
The sorry state of server utilization and the post-hypervisor era (2013) - vimes656
https://gigaom.com/2013/11/30/the-sorry-state-of-server-utilization-and-the-impending-post-hypervisor-era/
======
nickpsecurity
The article is looking at it all wrong. To solve a problem, start by looking
at those that already solved it. Then, see if you can apply that. Mainframes
have long had ridiculously high utilization and throughput. Secret is their
I/O architecture: computing happens on compute nodes and I/O is managed by I/O
processors, both of which are well-integrated. If Intel etc copy this, they'll
get much higher utilization and throughput. Smart embedded engineers do the
same thing albeit with microcontrollers.

[https://en.wikipedia.org/wiki/I/O_channel](https://en.wikipedia.org/wiki/I/O_channel)

------
blincoln
It's 2015, not 1985. Most people are not paying IBM for every CPU cycle (used
or not) on a mainframe. Should IT staff try to look "good" on a "CPU
utilization" report that belongs in a history book by buying lower-end
hardware, or should they spend a tiny amount of extra money to ensure that
customers get good performance during peak periods?

------
bhousel
These aren't really startling findings. Most apps in the enterprise require
separate instances for development, staging, production, and a hot standby for
continuation of business. And you need each of those environments for multiple
tiers (db, app server, etc). And you need the entire stack replicated to each
local datacenter because of latency (so the idea of having the APAC users use
the database at night and the NAM users use it during the day just doesn't
work in practice). So a typical business app can easily require >10 server
instances, most of which will sit idle most of the time.

~~~
parasubvert
It also reflects a very stubborn unwillingness to actually use virtualization,
ie. To collect capacity optimization metrics and to let that drive the
placement of VMs with appropriate over-provisioning.

Over-provisioning of RAM is dicey, and I/O-aware placement is still a black
art, but CPU is a no brainier. I routinely find places that refuse to anything
but 1:1 vCPU to physical core ratios, or even to enable VMware DRS/HA. Mainly
because they bought virtualization for convenience but then didn't update
their capacity and ITIL processes from the 90s where assets are pegged to a
physical CPU for "regulatory" reasons and capacity is still fear-driven rather
than data driven. Or, also common, vendors of packaged or platform software
... and bad Dev teams ... love to blame virtualization for performance
problems, rather than actually analyze and fix the problem. So over
provisioning becomes a political decision made by managers rather than a
technical one made by the ops staff.

I also don't see many places just allowing "shutdown/archival" of Dev/test
environments that are clearly not being used by metrics, or even to just have
a process that tells the ops team to press a button when project funding
ceases. It's obvious and simple but politically it is "risky" because some
VP's pet project having resources reclaimed makes them feel weak or something.

Then I find occasional data centers running widely over provisioned and high
(60%+) utilization, and life is fine, but for some reason these surveys never
make it to those places. So the laggards never rally find out that's it's "ok"
to stack VMs.

Now with container clusters like Mesos/marathon, Lattice/CF, and Kubernetes,
we are going to see some interesting behavior. A lot of companies are very
uncomfortable with the whole "you don't really know/care which physical
machine gets a container instance, it is fair share schedules as a whole". It
forces them again to admit their supporting processes are antiquated.

------
lsc
"A post-hypervisor world "

lol. I've been predicting a backlash against this virtualization hype since
2005, and this is the first time I've heard anyone else mention anything like
it.

Of course, if you had told me in 2005 that we would be switching from
hypervisors back to containers, I would have broke down crying.

Is our industry run by masochists? or just the inexperienced, who don't know
any better?

~~~
parasubvert
This was written by a VC hoping that Docker is going to be worth more than
VMware. I suspect he may be disappointed.

~~~
lsc
A lot of the value of vmware is in the sales channels. Why use VMware rather
than QEMU/KVM? it used to be that VMware came with support. But now that KVM
is owned by RedHat, which in my experience, gives way better than average
support? yeah.

but, yeah. Docker doesn't solve the "take this ancient rack of failing servers
and consolidate them down to one server... without updating the software" use
case that VMware is so often used for.

~~~
parasubvert
Why use VMware? Is indeed the existential question that's facing them. For
now, it's because many IT shops can't wrap their heads around the alternatives
or justify the switch. Lack of skills (lots of Windoes centric shops), deep
love for DRS/Vmotion/HA, deep support for fibre channel setups, etc. Btw this
is arguably why VMware announced Photon recently, to go after RedHat. Eat at
their Linux monopoly.

That said, VMware did basically invent x86 virtualization as we know it today,
and that's justified the many billions in wealth it has generated to date.
Docker is (so far) a registry and a CLI wrapper around a Linux kernel feature.
It can and will be more, but it's not clear what.

------
justincormack
If the issue is running out of memory before running out of CPU times, then
containers wont help much, apart from to the extent that memory is
overallocated with static amounts to vms. The solution is either larger memory
systems, which are now much more widely available since this article was
written, or using less memory for applications.

~~~
vimes656
Is hypervisor memory ballooning widespread in major cloud providers these
days? How does it compare to bare-metal kernel memory allocation?

~~~
justincormack
No it is not widespread. Underprovisioning is a bit of a dirty word too - it
breaks isolation.

The Google Borg paper says they use non production batch jobs to eat the
spare, so you can kill them if necessary. Cloud providers could offer this as
a service in theory, although they are not really architected that way.

------
mark_l_watson
Looking at this as an environmental problem makes some sense. I used to rent
cheap hosted servers and moved to virtualized systems like AWS, Azure, and
AppEngine partially because of environmental impact and partially out of
convenience.

We need staging servers and redundant backups so getting really high
utilization is not possible but I hope to see a lot of improvement.

The big companies seem to be doing things better. At Google, I had a bit of
angst running 10k processor jobs but they do use solar, set up data centers
near hydro electric sources, etc. Same as Amazon, Microsoft, etc.

------
PaulHoule
Well, over provisioning is good for perceived performance. Let corporate it
increase efficiency in the same ham handed way and you might find low latency
needs a new advocate.

------
LamaOfRuin
The idea that Google was industry leading on non-batch loads in 2013 seems
wrong to me. They were not selling those services then, so they did not have a
positive profit motive to optimize that usage (only a motivation to cut costs,
which I'm told is not nearly as effective). Amazon has had that motivation
(and necessity with their non-existent margins in every other part of their
business) for long enough to actually accomplish something.

~~~
parasubvert
at Google's scale, one doesn't need a lot of incentive to improve utilization.
Every IT shop has wanted the cost reduction of improved utilization since the
dawn of the PC era.

The difference is in process. Google's approach to workload placement is
automated by software, driven by engineering decisions and data.

Many IT shops' placement is political (new servers = new capital = power).

~~~
LamaOfRuin
At Google's scale you need more much incentive to get anything done. This is
even more true when it is something that will touch every division, product,
and service.

What every IT shop wants doesn't necessarily relate in any straightforward way
to what any IT shop invests resources in getting. Every IT shop prioritizes
many other things above utilization (and are right to do so).

All decisions, engineering or otherwise, are political. Different environments
involve different politics, but it's all still politics.

~~~
parasubvert
All decisions are political (ie. Power interests), but not all orginzations
are configured to be primarily driven by power. This is especially true for
young organizations, or those that have gone through a cycle of renewal.

Google decided early on to drive towards an operational architecture that
allows individuals to act at scale on their infrastructure. A developer
deploys into production, it launches thousands of new containers and disposes
thousands of old containers. A batch job is run, same thing. Deploying
services is uniform across the board. Thus, optimizing utilization through
improved container scheduling is something that the core site reliability
engineering team could do independently of individual services.

Google's early adoption of data center sized computing by Hozle & team was
unique, along with Amazon's CEO-diktat move to decentralized service-oriented
architecture, or Netflix's rewrite and move to cloud. Which is why you have
articles like this, written by a VC, that want to repackage this thinking and
sell it back to old school IT.

~~~
LamaOfRuin
> Thus, optimizing utilization through improved container scheduling is
> something that the core site reliability engineering team could do
> independently of individual services.

But is that something it is known they prioritized, or was there perhaps more
interest in optimizing the efficiency of deploying thousands of containers on
every deploy, across data centers, with reliable testing, without killing in
flight processing, and scaling for subsecond response to bursty demand? Who
sets the priorities for what is most important, and how much of one they're
willing to sacrifice to improve physical utilization?

I have absolutely no doubt they had as many resources as any other company
dedicated to finely tuning their data centers and related infrastructure. I
question whether they had the same motivation as a company like Amazon (who
was deriving direct profit from selling this resource) to prioritize the
optimization of utilization.

------
cm2187
But is average utilisation the right metric? The work day is only like 8-10
hours, I would expect many corporate infrastructures to be only active during
that period. Plus you don't size your infrastructure to a typical workload,
you size it to be able to accommodate higher than usual peak workload
otherwise you will be down at the busiest period.

------
dsr_
Which of these hypothetical situations is more realistic?

CEO: "I see that we had 99.994% uptime for the last six months, and we came in
very close to the forecasted budget. Well done, engineers!"

CEO: "I see that we had 99.9% efficient usage for the last six months, and we
reduced our budget. Well done, engineers!"

Neither scenario is realistic, of course. Uptime is nice and efficiency is
nice and budgets are nice, but what the CEO is actually interested in is:

VP Customer Support: "Our satisfaction rate is up, call quality metrics are
great. I looked over the call stats and it looks like we're no longer getting
complaints about performance or unreachability."

~~~
parasubvert
I've worked with the executives of some large banks, telecoms, and
transportation companies. The CEO and board generally only has held the IT
team accountable to budgetary performance and risk (uptime, intrusion,
regulatory) metrics. The only IT impact on customer sat is uptime, by the
traditional view.

One bank IT group I know that reports on uptime to their business partners
relative to operating expense, prints the charts and graphs on plotter paper
weekly and posts them in the cafeteria. Most of their bonus is directly tied
to those numbers. So, "cut costs and keep me up".

Delivery IT groups are very rarely measured by customer satisfaction, they're
measured by project and budget performance to baseline (on time, on budget,
etc). Customer sat is the responsibility of the business partners that drive
the requirements, programs, etc.

This this effective? Not really. If they recognized Lean product development
principles they'd incentivize everything by end-to-end cost of delay first,
and risk reduction second.

