
Why the data center needs an operating system - datascientist
http://radar.oreilly.com/2014/12/why-the-data-center-needs-an-operating-system.html
======
chuhnk
It's interesting and great to see this stuff take center stage more and more.
Those lucky enough to work at places like Twitter, Google, Facebook and other
large tech companies will have already seen how this kind of thing dominates
the datacenter's there and has been at the core of their systems for many
years. People on the outside though are rarely exposed to this concept of
datacenter scale computing aside from things like hadoop and so the idea of
leveraging a datacenter level API is hard to fathom. I believe it's still
going to be a while before this truly becomes mainstream and really we'll see
the historical two level forward shift whereby on one side we've previously
achieved compute as a service with EC2 and GCE, while developing platform as a
service on the other side with Google App Engine and Heroku. These service
different demographics and we'll continue to see the level of abstractions
occur providing SREs/Devops engineers access to Cluster As A Service and
microservice platforms for developers/programmers to give them more
flexibility.

Google SREs by last count were 1 engineer to 1000 machines. In 10 years the
common devops engineer at a 200 person startup will leverage the same number
of resources using layers of abstraction like Mesos and Kubernetes.

~~~
mattzito
> Google SREs by last count were 1 engineer to 1000 machines

That number does not seem particularly impressive, if it is accurate. Even
"traditional" well-run enterprise IT organizations are often in the 1
admin/SRE to 600-ish machines, so I have a hard time seeing that Google can
only do ~2x as good at their scale and with their level of focus.

1 SRE to 5k machines, 10k machines, that makes more sense to me.

~~~
devonkim
I don't know which kinds of enterprise orgs you're envisioning but when I
think about "traditional" enterprise IT orgs, I'm picturing companies in
healthcare, insurance, certain finance business units, non-profits, public
sector, and defense, not any place ever mentioned a lot here on HN in a
technical discussion. I've worked with a LOT of them and I'd be surprised if
the ratio of ops engineers to servers was anything better than 1:10 as a ball-
park number. Network guy, storage guy, DB guy, VM guy, automation guy, net-sec
guy, system-sec guy... that's pretty typical for maybe a 20-server rollout and
grows shy of sub-linearly with number of users (wild guess of O(n^1/2) ). If
you mean specific to SRE-type of roles or auto-scaled groups in EC2 / GCE to
count as servers perhaps that may be the case, but my idea of a traditional
enterprise IT org is that it views automation with fear and hesitation
preferring to add more people before trying to "disrupt" their business with
automation tools and so will be stuck somewhere around 1:30 at best.

Admittedly, I probably have a bad impression of enterprise from consulting
because who would pay for automation consulting at $$$ / hr when you do it
pretty well with existing resources in the first place?

~~~
WorldWideWayne
I think part of the difference is due to the fact that the business of
administering healthcare, insurance, financials and other businesses are
infinitely more complex than say, serving up search results or 140 character
micro blog posts.

It seems much easier in my opinion to scale a single function (search or
tweet) than the kinds of tasks that a healthcare company has to do like
say...scanning faxes from doctors, applying OCR and properly placing them into
a pharmacy order system.

~~~
justincormack
The idea is that those services get commoditized, so the healthcare company
gets them off the shelf rather than hires engineers to build them.

------
mattbee
The article's premise is a poor introduction to the project. Sure, reinvent
MOSIX if you want :) but don't pretend it'll serve more than a niche of a
niche.

Firstly, "distributed computing is the norm"? It's just not. Most businesses &
app authors will never need to care about ultra-distributed computing, with
all its problems and trade-offs. You can move faster with "local-only"
computing and scale vertically very cheaply compared to a few years ago - 4
dedicated CPUs + hundreds of gigabytes of RAM save programmer hours, and get
your problem solved faster.

For light scaling issues (compared to Google) Redis, MariaDB and other
abstractions over local files have some great options for future scaling, and
are well-trodden, obvious choices.

Secondly, who cares about "wasted" resources of a whole underutilised server
when reliable dedicated servers are so cheap, and in such plentiful supply?

Thirdly, "organizations must employ armies of people to manually configure and
maintain each individual application on each individual machine"? - in the 90s
maybe! Surely anyone with more than a few applications to worry about is on
board with some basic configuration management.

Twitter-size scaling is a "nice problem to have". For all but the best-funded
& bullish companies, solve them only when you start to have them.

(my bias: I run a managed service provider in the UK - we tend to help
customers scale vertically by shovelling server images around with minimum
down time. We say "underused" dedicated server capacity at fixed monthly costs
is usually cheaper than chasing the phantom of "optimum" AWS usage.)

~~~
srean
> Secondly, who cares about "wasted" resources of a whole underutilised server
> when reliable dedicated servers are so cheap, and in such plentiful supply?

The people bankrolling Google / Facebook / Twitter's electricity bills seem to
care quite a bit.

There is another trope that gets repeated often, (and this is not even
remotely directed at you, just a digression hopefully somewhat on topic)
"performant languages runtimes are an anachronism, a bog slow language in
which a programmer can code fast is way more useful than any of the
performance bull crap". Typically the person repeating that would a be a
webdev. However, in these large scale scenarios core infrastructural code can
save orders of magnitude more in money in running costs than saving days in
software development. So yeah at the interesting places algorithms and
efficiency continue to matter. A reason that Google always managed to be ahead
is partly due to how successful it was in minimizing running costs.

~~~
lmm
At Google scale it matters, sure. But most companies aren't Google scale.
Writing your single-company 100-user CRUD webapp in Java or C++ "for
performance" is the ultimate false economy.

/I used to work for a company that had a big Java app. We laughed at our
client who needed 60 Rails servers to deliver worse performance than our
single-instance app. But they probably saved more on dev costs than they spent
on servers.

~~~
presspot
A mid-market SaaS company might have a $1mm/mo Amazon bill. I think cutting in
that half would be meaningful to just about anybody.

------
aidenn0
This author talks about POSIX as if a bunch of people sat down and invented a
portable OS API. The reality is closer to: a group of people at AT&T created
Unix, a number of other companies and universities modified Unix, and then
people sat down and said "how can we unify the APIs of the fragmented Unix
world"

~~~
preillyme
The Programming API layer is fragmented in the UNIX world. There's mostly only
ANSI C, X, and gcc, which exist everywhere, but it's still not trivial to port
between *NIX. The kernel API is the most useless fragmentation of them all.
It's addressed by POSIX and UDI ([http://www.project-
udi.org/](http://www.project-udi.org/)), but the differences between the most
important UNIX kernels don't justify the problems they make.

~~~
davidgerard
Oh, we're completely cross-platform these days! That means it runs on both
RHEL _and_ Ubuntu.

~~~
preillyme
ha ha ha

~~~
davidgerard
ha ha only serious [http://www.catb.org/jargon/html/H/ha-ha-only-
serious.html](http://www.catb.org/jargon/html/H/ha-ha-only-serious.html)

It's VAXocentrism for the 21st century.
[http://www.catb.org/jargon/html/V/vaxocentrism.html](http://www.catb.org/jargon/html/V/vaxocentrism.html)

~~~
derefr
Interesting to think that the assumptions of C as a language evolved mostly
against the VAX architecture. It seems that if the VAX doesn't make a
distinction, then C doesn't (tend to) have any concept of that distinction
either. Examples:

\- Pointer types are basically fungible in C (otherwise there would be no
void-ptr type)

\- There's no compile-time knowledge of the segment a pointer references to
prevent you from dereferencing a pointer to an offset from segment A when
segment B is loaded (compare this to Rust's parameterization of Box types by
their allocator)

\- Struct padding is painful and tacked on

\- "unsigned char" isn't default even though it'd make much more sense for it
to be (What "char"acter is negative? You can have a signed byte/octet, but a
_character_ is—in 1979, at least—basically an enum/sum type.

~~~
aidenn0
Actually C disallows converting function pointers to/from void pointers.

~~~
davidgerard
Can you get away with that in pre-ANSI C? I wouldn't be surprised.

~~~
aidenn0
for pre-ANSI C most of these questions are answered by "it depends on the
compiler" which is why ANSI C happened. Of course the answer is still "it
depends on the compiler" for this behavior, since it's not defined what it
does in ANSI C, just that it's not portable to do it.

~~~
davidgerard
I remember working on a Fujitsu box in 1993, which used some sort of weird
System V Unix with pre-ANSI C and weirdly abbreviated man pages in Engrish.
_(shudder)_ Yeah.

------
preillyme
Mesos doesn’t run one big distributed app across servers in a cluster, but
rather interleaves the workloads on nodes in the cluster to maximize the
utilization of compute, memory, and I/O resources on the cluster. Mesos can
run on bare metal clusters or on machines that have some level of
virtualization on them to provide some isolation between workloads running on
the same machine. Google did a lot of the original work to create control
groups, or cgroups, a lightweight application container for Linux, which is
the basis for all LXC containers.

------
florianleibert
In light of all the excitement around Docker & containers, it's important to
note that Mesos has been using cgroups / LXC (for isolation) since its
inception.

Mesos paper:
[http://people.csail.mit.edu/matei/papers/2011/nsdi_mesos.pdf](http://people.csail.mit.edu/matei/papers/2011/nsdi_mesos.pdf)

~~~
vertex-four
It also supports Docker as of recently.

~~~
sciurus
For reference

[https://mesosphere.com/2014/11/10/docker-on-mesos-with-
marat...](https://mesosphere.com/2014/11/10/docker-on-mesos-with-marathon/)

[https://mesosphere.com/2014/12/03/docker-on-mesos-with-
chron...](https://mesosphere.com/2014/12/03/docker-on-mesos-with-chronos/)

------
perlgeek
Yes, we need new, standardizing APIs. However this made me cringe:

> Exposing machines as the abstraction to developers unnecessarily complicates
> the engineering, causing developers to build software constrained by
> machine-specific characteristics, like IP addresses and local storage.

It reminded me of the old RPC approach of making potentially any method call a
remote method call. It didn't work out, simply because remote call latencies
are orders of magnitudes higher than for in-process calls, so you need
different (coarser) APIs for them.

By the same token, while abstractions are very welcome, they should still
allow distinctions between local and non-local resources, for various
definitions of "local".

~~~
dllthomas
While I take your point, "local" is a leaky concept inside a data-center. On a
fast network, "in cache on the neighboring computer" can be closer than "on my
hard drive" for most purposes.

It seems like what might be ideal would be a specification language that
doesn't care where things are, an implementation that tries to deal with that
automagically, and a way to specify portions (to all) precisely that is
checked against the high-level specification.

~~~
thrownaway2424
Absolutely. In-core on a neighboring machine can be accessed in less than 1ms.
On-disk on the local machine (or any machine) can take tens of seconds.

~~~
chetanahuja
_" On-disk on the local machine (or any machine) can take tens of seconds"_

Surely you meant 10's of _milli_ seconds. But even so, fastest SSD random
access latencies are in sub-millisecond ranges.

~~~
thrownaway2424
No I did not mean that. Contended local disk access under heavy random
read/write workloads takes essentially forever.

------
zobzu
This stuff is very confuse, and it confuses all the HN commenters even more.

There is no such thing as a datacenter OS.

Simplification: An OS is a kernel and it's associated base software.

The kernel drives the hardware. You need to talk to disks. Memory. what not.
Kernel is needed. You need to talk to the kernel and tie these components
together. You write software. Boom, you have an OS.

A datacenter isnt a disk and memory and devices. its a bunch of computers
which themselves drive these devices.

If you invented a "datacenter" with a bunch of disks and an API to drive them,
then a bunch of CPUs and an API to drive them, and so on, you'll end up with a
supercomputer and a single, unreliable, unsafe OS (which is exactly why nobody
makes super computers anymore. They make clutsters of computers. Cheaper, more
reliable. Clusters. Ie... datacenters).

The only thing that could be needed is a universal API for resource access.
Need a db? Here's an API. Need disk space? Here's an API. and so on.

This is exactly what AWS is and does. S3 doesnt expose the OS. Its just a
filesystem API. ELBs arent an OS. They're a load balancer API.

It turns out that below that, there's an actual traditional OS because thats
the way it works reliably.

You have other ways to interconnect these systems in smarter ways (see plan9)
but its always running a "regular" OS in the end, too.

~~~
presspot
Is a 40,000 core disaggregated rack really that different from a 4 core
laptop? Google pioneered this way of thinking [1]. tl;dr the datacenter is a
computer and that computer will inevitably have an OS.

[1] [http://www.cs.berkeley.edu/~rxin/db-
papers/WarehouseScaleCom...](http://www.cs.berkeley.edu/~rxin/db-
papers/WarehouseScaleComputing.pdf)

~~~
zobzu
Indeed, ie the datacenter doesnt expose an OS but a set of APIs. There's
enough differences between that and a full OS (which also provides its own
API) to call bs IMO.

------
discodave
One way to think about this problem is to look at components that people
thought were useful in an operating system and then think about what a
distributed version might look like.

E.g. for an OS we have:

filesystem

scheduler

cron

Then you can go look at the history of these primitives so the same lessons
don't have to be relearned. For example, the Linux kernel has gone through
many iterations of its scheduler with cgroups, cfs etc. Why did they do that?
Why were previous incarnations not good enough? etc.

~~~
presspot
That is _exactly_ right.

------
preillyme
While not a complete data center operating system, Mesos, along with some of
the distributed applications running on top, provide some of the essential
building blocks from which a full data center operating system can be built:
the kernel (Mesos), a distributed init.d (Marathon/Aurora), cron (Chronos),
and more.

Interested in learning more about or contributing to Mesos? Check out
mesos.apache.org and follow @ApacheMesos on Twitter. We’re a growing community
with users at companies like Twitter, Airbnb, Hubspot, OpenTable, eBay/Paypal,
Netflix, Groupon, and more.

------
stryan
This sounds like the problem Plan 9 set out to solve

~~~
ferrari8608
I was hoping I would see someone mention Plan 9 here. Solving that problem
does indeed seem like what their goal was.

"The early catch phrase was to build a UNIX out of a lot of little systems,
not a system out of a lot of little UNIXes."

[http://plan9.bell-labs.com/sys/doc/9.html](http://plan9.bell-
labs.com/sys/doc/9.html)

------
sea6ear
In my mind when I think of abstractions above the level of individual
machines, I pretty much automatically go to Erlang as the only language that I
know of that operates at the right level of abstraction for that.

It's the only (semi-mainstream?) language that I know of that includes the
infrastructure in the language itself to make the individual underlying
machines (OS / hardware) appear irrelevant and allow programming to seamlessly
span a group of computers.

------
23david
I definitely wouldn't consider Mesos to be anything like an operating system
for the datacenter. That's just marketing language and confuses things.

Mesos is basically an application scheduler. It doesn't manage the base
operating systems or machine provisioning. Mesos is concerned with ensuring
that one or multiple applications are launched and running on a cluster of
machines.

The Saltstack framework is the only thing I know of that would provide all the
primitives and control necessary for an operator to truly control a full
5-300,000 node datacenter from one station. It's the only thing out there that
will allow for super low latency response to commands across the cluster. It
also can do configuration management etc if you want, or a lot of people use
it to just trigger existing chef/puppet jobs.

I've been using Saltstack in conjunction with Mesos to build out this full
datacenter-as-OS stack. Works... :-)

~~~
gtaylor
> It's the only thing out there that will allow for super low latency response
> to commands across the cluster. It also can do configuration management etc
> if you want, or a lot of people use it to just trigger existing chef/puppet
> jobs.

FWIW, Ansible also does low-latency communications through
pipelining/accelerated mode. Ditto for the config management and triggering.

~~~
res0nat0r
I don't believe even with Ansible's accelerated mode it can come anywhere
close to Salt for large deployments. It is still agentless and has to run from
a single source and using ssh which is going to be slower than 0MQ. Also using
something like salt-syndic to create topologies can reliably scale to
thousands of machines and still cope with the load.

------
cdnsteve
A singular OS that acts on well defined abstractions and interfaces making the
number of resources the massive machine uses transparent. Software won't be
the only thing changing to support this shift in large scale computing but
hardware as well. Would this type of machine be autonomous with little user
management? AI could assist the machine in optimizing configuration. Pretty
soon it could interface with drones that repair hardware and even add new
nodes. Zero configuration environment. The operator simply tells it what it
should be running, until Skynet is unleashed.

------
glibgil
The thing I like about Mesos is that when you approach your problem you are
planning for High Availability from the beginning. You don't get that with
Docker. Docker let's you make your provisioning stable and repeatable, but you
don't get any advantage on HA. You are on your own with Docker. But, Mesos
guides the whole architecture of your app towards HA which is nice.

~~~
preillyme
Docker aids software deployment through the automation of Linux Containers
(LXC). While Docker has made the the life of developers and operators easier,
Docker plus Mesosphere provides an easy way to automate and scale deployment
of containers in a production environment.

------
joshbaptiste
Combined with rump kernels I can see this as a the data center of the future..
simply a micro-kernel running a specific application without the bloat of a
full OS while an orchestration unit/OS such as Mesos does all the setup and
tear down of such services providing resources on demand.. lovely.

------
ldng
Mesos is just one piece of the puzzle. I think the Hadoop project proper will
be the first to get there. Maybe v3 or v4. That said, given the diversity
under the Hadoop umbrela I expect distribution to appear, like RHEL is a
GNU/Linux OS there might be a (Cloudera ?) Hadoop OS.

~~~
SEJeff
I wouldn't be betting my marbles on Hadoop, or anything Map Reduce now adays.
Spark beats it pretty handily for data processing although HBase isn't too
bad.

~~~
sciurus
With YARN, Hadoop has built a more generic scheduler comparable to Mesos.

[https://www.quora.com/How-does-YARN-compare-to-
Mesos](https://www.quora.com/How-does-YARN-compare-to-Mesos)

~~~
SEJeff
Not sure about that (haven't used yarn however), YARN seems extremely Hadoop
specific, whereas Mesos is a "closer to the metal" generic framework to build
frameworks for scheduling things on.

If I was to compare the two, I'd call YARN an application scheduler and Mesos
a more meta scheduler. You could build yarn ontop of Mesos. You likely
couldn't easily build Mesos ontop of Yarn. It should be relatively trivial to
make a YARN Framework, which someone has already done with Mesos:

[https://github.com/mesos/myriad](https://github.com/mesos/myriad)

~~~
ldng
True but Hadoop is the whole tooling and with YARN they are distancing
themselves from Map/Reduce. IMHO, it looks like they are heading in that right
direction but not quite there. That why it maybe v4 before we can think of
Hadoop as an OS for datacenter.

Mesos is closer to the metal but still is only a piece of what is needed.
Picturing it as an OS is misleading while picturing Hadoop V2 as a proto-OS is
not that much in accurate.

------
superbaconman
I feel like this is on the right track, but sadly fails to address the
underlying problem. Until the network is virtualized everything else will
suffer. If the cloud is to become the default platform for services (I think
it will), then the user must be able to define their own virtual data center.
This includes their own virtual nets. What the author writes about will be
built upon these virtual nets, but until the network itself is able to be
programmed we'll likely be drifting between optimal states of computing.

~~~
presspot
The DCOS supports network virtualization. Changes in underlying infrastructure
can be manifested all the way up the stack to applications and their
schedulers. E.g., the DCOS can rewrite network routing tables based on
placement decisions.

------
nqnielsen
So true; think about what multi-tasking and virtual memory have done for
developer convenience _and_ system performance over the last decades. Bet we
are going to see the same effects in this space too - there are so many
interesting things to do in scheduling! Especially when you can distinguish
between clever placement of long lived tasks vs fitting short-lived low-
latency jobs.

On another note; next up: "data center_s_ need an operating system". Very
interested to see the move towards multi DC/availability zones.

------
presspot
The dataceneter == the new form factor

~~~
joefarish
What makes you say that?

~~~
presspot
Datacenter and back-end apps simply don't fit on a single machine anymore.
Every app of reasonable scale is probably a distributed system of some sort.
That, and there are a new class of "apps" (or, more precisely, datacenter
services) that were built to operate across fleets of machines from day one,
such as Spark, Hadoop, Cassandra, Kafka, Elasticsearch, and so on

~~~
toast0
On the other hand, a single machine can get pretty big today, for example
[http://www.supermicro.com/products/system/2U/6028/SYS-6028U-...](http://www.supermicro.com/products/system/2U/6028/SYS-6028U-TR4T_.cfm)
is 2u, has 12 drive bays and can have 1.5 tb ram. You could do 24 bays if you
are using 2.5" drives (most ssds). A lot of complexity can be avoided by
getting many fewer, but bigger machines (doesn't help much if you need massive
CPU though, but intels architectures are getting better every cycle)

~~~
tacticus
The problem with this is that singular systems require a lot of engineering
effort and produce interesting side effects when you attempt to make them
reliable.

Any singular system is going to fail.

~~~
toast0
Ok, get two huge systems instead of a datacenter of small systems. You still
have made great strides in reducing complexity.

~~~
tacticus
and sharing state between them? or ensuring that only one is doing things when
only one should be doing things? (STONITH needs to be implemented somehow)

the performance penalties you enjoy if you go down the shared disk path (which
is still going to be a fun failure when it does fail)

------
rakoo
This screams Erlang all over the place: "portable", common API everywhere,
automatic aggregation of resources, processes spawning wherever there is a CPU
too a little bit too idle, process isolation...

I'm no Erlang expert, but it seems to me all the fundamental bricks are here
(and were for multiple decades!) to build this "datacenter platform".

------
nwilkens
In 2005 I saw a demo of this product from VirtualIron called VFe, which did
this on a smaller scale (maybe 10 servers) -- it looked amazing.

[https://www.kernel.org/doc/ols/2005/ols2005v2-pages-243-258....](https://www.kernel.org/doc/ols/2005/ols2005v2-pages-243-258.pdf)

~~~
wmf
Mosix, Virtual Iron, vNUMA, ScaleMP, etc. are variants of the RPC fallacy:
making remote/slow/independent things pretend to be local/fast/fate-shared.
[http://www.eecs.harvard.edu/~waldo/Readings/waldo-94.pdf](http://www.eecs.harvard.edu/~waldo/Readings/waldo-94.pdf)
Successful distributed software tends to go the opposite direction, making
asynchrony and unreliability the base case.

------
frozenport
This article feels out of touch to somebody who frequently types in _mpirun_
from a login node. It seems like the level abstraction is already achived by
MPI and the task schedulers that come with it.

~~~
presspot
MPI will run on top of the DCOS. That is the point. It supports a multitude of
scheduling models all running multi-tenant.

------
preillyme
Given that VMS futures, POSIX support, RMS features and programming, server
and application scaling, and VM sprawl have all been discussed recently, this
is really something else to ponder.

------
digitalzombie
Mesos!

If you want containers use kubernetes by google!

Was in the process of putting cassandra on mesos.

If you want to mess around with it, you just need a digitalocean account and
head over to mesosphere they'll set up 5, 7, 10 clusters for you.

------
timchen1
isn't this what all the hype behind CoreOS is about?

~~~
preillyme
Well the main idea behind Mesos is to create an aggregation layer that spans
server capacity in local and remote datacenters. This aggregation does not
tightly bind the servers into a shared memory system, but rather treats them
like a giant resource pool onto which applications can be deployed. It is a
bit like the inverse of server virtualization really.

