

Return of the Borg: How Twitter Rebuilt Google’s Secret Weapon - smacktoward
http://www.wired.com/wiredenterprise/2013/03/google-borg-twitter-mesos/all/

======
fourspace
As mentioned in another comment, I managed a Google SRE team that help run
Borg for 5 years. I'm good friends with several of the folks who are
rebuilding it using Mesos at Twitter. They are all insanely awesome people.

Let's say you have a Rails stack (app server + DB) that you want to deploy for
testing in 3 different datacenters. If that works out, you then immediately
need to deploy 10,000 instances each to 10 different datacenters. Oh yeah, and
the storage needs to be distributed and universally available, in case any of
the servers crash. Performance testing reveals that you can't have more than
10 servers per rack, otherwise you saturate the rack switch. You also need to
account for power distribution redundancy, shifting traffic loads, etc.

Oh, and you want to do this with a single configuration that's manageable by a
team of 3-4 people and have deployment be entirely automated and monitored.

The complexities behind this problem are simply enormous, almost too much to
even comprehend. I am proud of the work we did at Google to attempt (yes,
attempt) to solve this problem, and I know my friends at Twitter are doing
great work as well. I'm mostly just happy that I can finally talk about it and
that the badasses that do this work can get some credit.

~~~
sneak
I think that in the end, Google's greatest gift to the world will be all the
experience and knowledge that they put into the minds of the world's best
engineers that worked on big stuff there over the last ten years or so.

Those people are building truly amazing things now based on that (for lack of
a better term) incubator for the wider industry.

The stuff that has been built directly by Google is tremendous, to be sure.
But the fundamental understandings that have come out of their R&D to do it
will benefit everyone across the entire industry. It's magical to watch.

------
ChuckMcM
I have dreams of working with the Borg team when I was Google, and like Jean-
Luc Picard, I sometimes wake screaming from those dreams :-)

I have a huge amount of respect for the folks in SRE who would juggle clusters
like you and I might adjust thermostats around the office. It was not a job
for the feint of heart, and personally I don't think it was appreciated as
much by the people who deployed on it as it should have been, but that is
always somewhat true about the operations side of the house, you know you're
doing ok when nobody is screaming at you.

~~~
jamesaguilar
Speaking of thermostats, that reminds me of one time on my previous team where
I started up too many image processing workers and basically saturated a
colo's worth of CPU. It got to the point where the datacenter's temperature
exceeded some alert threshold and I had SREs phoning me and asking what I had
done.

Man, working at Google is just a different kind of experience in some ways.
Not necessarily better, though I like it a lot, but definitely different.

------
vicaya
The article is content free for people who are actually interested in resource
management. I'm curious about the cluster utilization on these Google and
Twitter clusters. I know Yahoo clusters had/have embarrassingly low
utilization.

IFAICT, Omega and Mesos (and YARN for what it worths) cannot really handle
resource overcommit effectively. I wonder what they'll call it when they
reinvent a better DRS :)

~~~
jmillikin
That depends on how you measure utilization.

A naive approach is to measure idle CPU or RAM not allocated to a process.
Then improving utilization metrics is simply a measure of fitting more
computation onto one machine until there's not enough CPU time or RAM to fit
any more.

This works fine for throughput-oriented workloads, but will cause immediate
conflicts when imposed on teams who value low latency (e.g. the Search team at
Google). So you end up in a situation where a team is being pressured to
improve utilization at the expense of metrics that they care more about, and
they refuse, and that's when you get grumbling around the water cooler about
"wasteful propellerheads squandering expensive hardware!" / "ignorant
beancounters micromanaging things they don't understand!".

------
sp332
It might not be as obvious, but I think Omega is also a Star Trek reference.
<http://en.memory-alpha.org/wiki/Omega_Directive>

~~~
alan_cx
Must be. It makes sense too. The Omega particle was sort of holy grail for the
Borg, as well as known and sought by the Federation.

~~~
mitchty
Known sure, but sought is pushing it considering how adamant they were at
destroying it.

------
jzelinskie
Honestly, everyday I'm continually impressed by posts describing the internal
technology powering Google. Having an abstraction layer like this for global
deployment couldn't be more convenient. With the news about Facebook the other
day, I watched a talk on OpenFlow
(<https://www.youtube.com/watch?v=VLHJUfgxEO4>) and the fact that they have a
test suite they can plug into continuous integration in order to see if your
commit breaks your application in any network situation is incredible. I'd
love to sign an NDA and just talk to the teams working on these projects for
hours. Truly impressive work.

------
austenallred
"Then, Hindman and his friends decided they should work on a project together
— if only because they liked each other. But they soon realized their two
areas of research — which seemed so different — were completely
complementary."

I see this as a great example of how innovation happens. You have things that
you wouldn't expect to go together working together almost out of sheer
chance, and out of that comes something brilliant.

I also love that Hindman joined Twitter as an "Intern" after he was a
consultant.

------
23david
I keep thinking about Mesos and I don't get what the big deal is here. With
low-overhead virtualization like linux containers combined with configuration
management tools like Salstack, Ansible, Chef or Puppet, it isn't a big deal
to have multiple properly configured applications running on individual
physical machines. And to then distribute out workloads to those machines
isn't super hard if you use distributed frameworks or job managers such as
gearman, celery or something built on top of ZeroMQ. Am I missing something
here?

You could even get all fancy by using openvswitch to configure your own
virtual network topology. It just doesn't seem so complicated what they're
trying to accomplish.

~~~
fourspace
I managed a Google SRE team that helped run Borg for 5 years, so I'm pretty
familiar with this. You can certainly run multiple application servers on one
machine using virtualization. What isn't easy is automating where the
applications are deployed, in an elastic, constantly shifting way that deals
with massive numbers of machines and constantly changing hardware.

Cheap hardware crashes. It crashes all the time. Furthermore, the
application's needs themselves change all the time, depending on traffic
curves and processing needs. The dynamic needs of Google's (and presumably
Twitter's) completely heterogeneous application stacks don't lend themselves
to simple virtualization and over-the-counter software. This is an incredibly
tricky bin packing problem that was never quite solved in my 5 years at
Google.

I don't know a lot about the "distributed frameworks" you mentioned, so
perhaps they do this too. I kind of doubt it. If it were as easy as you think,
I'm sure my friends at Twitter would be using what you mentioned.

------
packetslave
If you want to get a sense of the scale at which systems like Borg (and
presumably Mesos) operate, take a look at the Google cluster dataset that was
released a while back: <https://code.google.com/p/googleclusterdata/>

"The trace represents 29 day's worth of cell information from May 2011, on a
cluster of about 11k machines"

A sample: over a 7 hour period, just that one cluster executed 3.5 million
unique tasks.

------
mbreese
How do these systems compare to traditional cluster task schedulers like PBS
and SGE? Is it that Mesos/Borg/etc have tasks that are managed within the
program as opposed to the PBS/SGE model where you write separate programs that
can then be scripted to be distributed? (Not counting MPI).

Are these like the next generation of task schedulers, or are we talking about
a higher abstraction level?

------
ww520
I am not familiar with the Borg or the Omega system in Google and resource
allocation optimization is usually NP hard, but would a random allocation
approach be a good enough solution? It might not produce the optimal
allocation but since things are changing all the time anyway, a simple good
enough solution that adapts to the changes is certainly desirable.

~~~
jjm
With random allocation you need good entropy and distribution must be very
uniform, otherwise you get the dyno issue we saw with heroku. I think random
would still be blind to various jobs, where even the smallest of timing
optimization would yield some real improvement. Perhaps with a mixed approach,
and a nearly similar priority range (+\\- a priority level) might be useful.
Sounds like a good time to find some research about the topic.

------
nonane
Isn't Microsoft's trying to build something similar with Azure? It'd be best
described as an 'operating system for the cloud'.

------
swah
I was expecting Mesos to be written in Java, but its written in C++.

