
New open source tech Marathon wants to make your data center run like Google’s - tknaup
http://gigaom.com/2013/09/04/new-open-source-tech-marathon-wants-to-make-your-data-center-run-like-googles/
======
contingencies
_at a broader level, projects like Marathon tie into the greater move toward
software-defined networks, storage and even data centers. Companies are trying
to replace expensive gear with commodity gear powered by smart software, and
being able to automate cluster management and failover is certainly part of
that equation._

At least this part is spot on.

They _seem_ to be focused on ZooKeeper (both are Apache projects). I'd very
much like to see a decent comparison of Pacemaker/Corosync @
[http://clusterlabs.org/wiki/Main_Page](http://clusterlabs.org/wiki/Main_Page),
ZooKeeper @ [https://zookeeper.apache.org/](https://zookeeper.apache.org/),
OpenReplica @ [http://openreplica.org/](http://openreplica.org/) ..
particularly given that the former is what I am most familiar with and what
larger Linux businesses (RedHat, etc.) seem to be focused on.

------
electic
Won't Docker be able to take this type of efficiency to a whole new level?

~~~
WestCoastJustin
No. Docker just supports provisioning containers. Take a high level view of a
data centre with 10,000 machines, you need orchestration software that knows
about and automates these 10k machines, which ones are online/offline,
utilization, storage, networking, current state, power distribution (what
happens if a racks power drops out?), etc and then provisions these containers
onto that hardware. Think of this orchestration software like a brain, and it
knows about the current state of things, keeps a watch on what is happening in
the data centre, fixes things when they go bad, and knows where to put things
when you want to do something.

    
    
      Docker --> hardware
      Orchestration software --> Docker/LXC/VM/etc --> hardware
    

An additional example would be dotCloud (makers of Docker), they have
orchestration software sitting atop Docker, which knows about users, machines,
etc and then provision these docker instances on AWS hardware.

    
    
       dotCloud (orchestration software) --> Docker --> AWS EC2 (hardware)
    

There is a great wired article (linked to below [1] and in the OPs story),
which outline how Google uses this orchestration software in its day-to-day
operations. There is this great diagram [2], which shows Omega (Google's
orchestration software), and how it deploys containers for images, search,
gmail, etc onto the same physical hardware. There is an amazing talk by John
Wilkes (Google Cluster Management, Mountain View) about Omega at Google
Faculty Summit 2011 [3], I would highly recommend watching it!

ps. Orchestration software has a global view of resources across all machines,
so it knows how to get the best utilization out of all these machine (think
jigsaw puzzle). You submit a container profile to the orchestration software,
things like instance lifetime, cpu, storage, memory, redudance, and the
software will figure out where to place your instances.

[1] [http://www.wired.com/wiredenterprise/2013/03/google-borg-
twi...](http://www.wired.com/wiredenterprise/2013/03/google-borg-twitter-
mesos/all/)

[2] [http://www.wired.com/wiredenterprise/wp-
content/uploads/2013...](http://www.wired.com/wiredenterprise/wp-
content/uploads/2013/03/google-omega-info.gif)

[3]
[http://www.youtube.com/watch?v=0ZFMlO98Jkc](http://www.youtube.com/watch?v=0ZFMlO98Jkc)

~~~
electic
Thanks! That clears things up. I appreciate that.

P.S. - Not to Justin but others. Funny, you ask a honest question on here and
people downvote you...sigh.

~~~
wmf
I think Docker is great but it's been so overexposed on HN that there's a bit
of a backlash.

------
peterwwillis
_" Marathon launches two instances of the Chronos scheduler as a Marathon
task. If either of the two Chronos tasks dies -- due to underlying slave
crashes, power loss in the cluster, etc. -- Marathon will re-start an Chronos
instance on another slave. This approach ensures that two Chronos processes
are always running._"

So basically, it's a distributed crond. I could be wrong, but I don't think
this will make your datacenter run like Google's.

~~~
necubi
Really, it's mesos that "makes your datacenter run like Google's," while
Marathon is a framework built on top of it that provides distributed service
management (think distributed upstart). Mesos is in part based on ideas from
Google's Borg cluster management system.

The basic idea is to provide the abstraction of a single machine to
developers, who can say things like: I want to run this program on 40 cores
with 100GB of memory, and the system will find those resources across the
cluster, create a cgroup with your resource limits, and start your services.
It also provides various tools for writing distributed services, like an
implementation of PAXOS.

~~~
peterwwillis
But Mesos already provides distributed service management. Why is Marathon
doing it too? And why do you need Chronos if Marathon performs the same
functions?

~~~
florianleibert
Chronos is the "cron" for the cluster. Marathon is the "upstart" for your
cluster. You can start Chronos via Marathon.

~~~
peterwwillis
Yes. I get that. Except Marathon can do 'cron-like' things too. So can Mason.

Mason manages cluster applications. So does Marathon.

Mason can juggle task resources. So does Marathon.

Mason is a framework. So is Marathon.

Mason has a scheduler. So does Marathon.

The only thing Chronos does, at all, is run jobs on a schedule, distributedly.
Which Marathon and Mason can do too.

So can someone please explain to me _why_ you need all three??

\--

Look, here, at the infographics in the middle of the page:
[https://github.com/mesosphere/marathon](https://github.com/mesosphere/marathon)

What they're saying is, Marathon will move your jobs to a new server when one
dies. Okay, cool. But Chronos can do that too. And Mesos can do that too!

\--

I'm pretty sure all these tools are a giant troll by Google to get its
competitors to burn R&D time on reinventing tools that already exist and
aren't necessary.

~~~
necubi
If you think these tools aren't necessary, you probably haven't managed a
cluster with thousands of machines and hundreds of users. I imagine that every
company in that situation has an ad-hoc implementation of distributed cron and
distributed upstart (I know we do).

Without something like Mesos, you generally will run different things on your
cluster by statically partitioning it (these ten racks run Hadoop, this rack
runs our website, this rack runs Spark because some engineers wanted to try
that out, etc.) or by running everything together (typically done with your
distributed file system, but can be problematic with more compute-oriented
services).

The mesos approach is to stop thinking of your cluster on a machine-by-machine
or rack-by-rack basis, but instead as just a giant pool of resources. It's a
very powerful abstraction that greatly increases the number of machines and
developers that are manageable.

~~~
peterwwillis
I'm familiar with the concept behind it. My problem was with how they all seem
to do the same things, and nobody yet has pointed this out; everyone just
accepts the fact that they're mostly redundant and moves on.

I've managed SSI clusters, MPI clusters, and clusters of dumb app servers of
varying sizes (10 nodes to 10,000). If you really want just a giant pool of
resources, you can do much worse than an SSI cluster, but nobody wants to
spend time working on a hard problem, so instead we dick around with task-
shuffling job-runners inside the components that were written by the hardcore
programmers that work in the kernel. But I guess we do what we can with what
we have... (I blame Linus's team for not merging openMosix when they had the
chance!)

------
lambda
So, how does this compare to the Corosync/Pacemaker cluster stack?

~~~
necubi
I'm not familiar with those tools, but from the clusterlabs website it looks
like they're solving a different problem. Mesos is primary intended for use on
clusters with thousands of machines and heterogenous application needs. It
provides resource isolation and scheduling as well as some components that
make writing distributed systems easier.

------
icecreampain
Skimmed through the article and I didn't find any evidence of "making your
data center run like Google's". More specifically: I didn't see any live data
connections to the NSA.

~~~
necubi
What value could this possibly add to the discussion?

