

Data Center Servers Suck, But Nobody Knows How Much - nightbrawler
http://www.wired.com/wiredenterprise/2012/10/data-center-servers/

======
jws
Just an echo of the NYTimes non-story.

Bizarrely, this one self repudiates its own title near the end where they
contacted a person that knows what they are doing…

 _Over at Mozilla, Datacenter Operations Manager Derek Moore says he probably
averages around 6 to 10 percent CPU utilization from his server processors,
but he doesn’t see that as a problem because he cares about memory and
networking._

 _…_

 _After we contacted him, Moore took a look at the utilization rates of about
1,000 Mozilla servers. Here’s what he found: the average CPU utilization rate
was 6 percent; memory utilization was 80 percent; network I/O utilization was
42 percent._

CPU use is irrelevant to most internet servers.

Over the week, I operate my car engine at about 1.2% capacity. Maybe they
should write about that.

~~~
flatline3
I can't say I understand that. If RAM utilization is so vastly out-scaled as
compared to CPU utilization, there's a significant resource use inefficiency
at play.

Machines consume a baseline amount of power whether they're used or not; that
power usage obviously increases with utilization, but ideally you'd have full
utilization across the board.

If memory usage is so much higher than CPU usage, I have to wonder what it is
that Mozilla is doing wrong with their architecture. Are they using pre-fork-
style servers? Are they just provisioning poorly? What is it?

> _CPU use is irrelevant to most internet servers._

Why? The CPU is used when the machine _does_ anything. Ideally you're
operating the machines at full capacity, less overhead to handle load spikes.

> _Over the week, I operate my car engine at about 1.2% capacity. Maybe they
> should write about that._

What you're doing is inefficient, and they do write about that. The solution
is called car sharing and public transportation.

~~~
cube13
>If memory usage is so much higher than CPU usage, I have to wonder what it is
that Mozilla is doing wrong with their architecture. Are they using pre-fork-
style servers? Are they just provisioning poorly? What is it?

Or are they just serving up web pages to users? That's RAM and bandwidth
heavy, but very CPU light. You still need the machines to scale your load, but
you're not going to be using the CPU.

Realistically, for just about any application, you're going to be RAM-bound
before you're CPU-bound. The exceptions are(off the top of my head) scientific
computing and video rendering, both of which are CPU heavy, and are very
deterministic in their behaviors, which allows for heavy optimization of L2
and L3 cache misses.

~~~
flatline3
> _Or are they just serving up web pages to users? That's RAM and bandwidth
> heavy, but very CPU light. You still need the machines to scale your load,
> but you're not going to be using the CPU._

That depends very much on the efficiency of software your architecture. A
well-architected web app can scale up RAM and CPU utilization much more
closely than something modeled on zero shared state independent processes.

Additionally, even if your scaling model of RAM before CPU is the only
possible one, that doesn't make the utilization effecient, and implies that
higher efficiency could still be reached by scaling up RAM per machine.

------
ChuckMcM
Sigh, I wonder why people write stories like this.

Data Center servers don't suck, and I'd bet that most folks running them
understand what their utilization is. Blekko has over 1500 servers in its
Santa Clara facility and we know pretty much exactly how utilized they are,
but that is because we designed the system that way.

Its funny how things have come full circle. Back in the 70's you might have a
big mainframe in the machine room, it was so expensive that the accounting
department required you to get maximum use out of the asset, so you had batch
jobs that ran 24/7. You charged people by the kilo-core-second to use
economics to maximize their value.

Then minicomputers, and later large multi-CPU microcomputer servers (think Sun
E10000 or the IBM PowerPC series) replaced mainframes they didn't cost as much
so the pressure to 'get the return' was a bit lower, you could run them at 50%
utilization and they still cost less than you'd expect to pay for equivalent
mainframe power.

Then the dot.com explosion and suddenly folks were 'co-locating' a server in a
data center because it was cheaper to get decent bandwidth there rather than
run it the last mile to where your business was. But you didn't need a whole
lot of space for a couple of servers, just a few 'U' (1.5" each of vertical
space) in a 19" rack. And gee, some folks said why bring your own server, we
can take a machine and put a half dozen web sites on it and then you could
could pay like 1/6th the cost of 4U of rack space in the Colo for your server.
Life was good (as long as you weren't co-resident with a porn or warez site
:-)

Then, at the turn of the century, the Sandia 'cheap supercomputer' and NASA
Beowulf papers came out and everyone wanted to put a bunch of 'white box'
servers in racks to create their own 'Linux Cluster' and the era of 'grid'
computing was born.

The interesting thing about 'grid' computing though was that you could buy 128
generic machines for about $200K which would out perform a $1.2M big box
server. The accountants were writing these things off over 3 years so the big
box server cost the company $400K/year in depreciation costs, the server farm
maybe $70K/year (if you include switches and such like) so it really didn't
matter to the accountants if the server farm was less 'utilized' since the
dollars were so much lower but the compute needs were met.

Now that brings us up to the near-present. these 'server farms' provided
compute at hitherto un-heard of low costs, and access to the web became much
more ubiquitous. That set up the situation where even if you could offer a
service where you got just a few dollars per 1,000 requests to this farm, like
a real farm harvesting corn, you made it up in volume. Drive web traffic to
this array of machines (which have a fixed cost to operate) and turn electrons
into gold. If you can get above about $5 revenue per thousand queries ($5 RPM)
you can pretty much profitably run your business from any modern data center.

But what if you can't get $5 RPM? Or your traffic is diurnal and you get $5
RPM during the day but $0.13 RPM at night? Then your calculation gets more
complex, and of course what if you have 300 servers, 150 of which are serving
and 150 of which are 'development' so you really have to cover the cost of all
of them from the revenue generated by the 'serving' ones.

Once you start getting into the "business" of a web infrastructure it gets a
bit more complicated (well there are more things to consider, the math is
pretty much basic arithmetic). And 'efficiency' suddenly becomes something you
can put a price on.

Once you get to that point, you can point at utilization and say 'those 3
hours of high utilization made me $X' and suddenly the accountants are
interested again. For companies like Google whose business is information
'crops', they were way ahead of others in computing these numbers, Amazon also
because well they sell price this stuff with EC2 and AWS and S3 and they need
to know what prices are 'good' and which are 'bad.' It is 'new' to older
businesses that have yet to switch over to this model. And that is where a lot
of folks are making their money (you pay one price for the 'cloud' which is
cheaper than you had been paying, so you don't analyze what it would cost to
do your own 'cloud' type deployment). That will go away (probably 5 to 10
years from now) as folks use savings in that infrastructure to be more
competitive.

------
jeremyjh
Conflating utilization with efficiency is a mistake that is made so often and
so reliably that it probably alone explains why reporting on this is sparse.

To understand if a system is inefficient, I would have to know if the same
computing could be performed at a lower total cost of ownership. The
utilization of the actual specific resources is an input, but only to
understand the computing demand. The reason it is so common to "throw servers
at it" is because the hardware and even the power is very cheap compared to
human labor costs; and there are significant labor efficiencies in deploying
infrastructure in batches.

------
gphil
"After we contacted him, Moore took a look at the utilization rates of about
1,000 Mozilla servers. Here’s what he found: the average CPU utilization rate
was 6 percent; memory utilization was 80 percent; network I/O utilization was
42 percent."

Sounds like they are stuffing as much data as possible into RAM for
performance reasons, which causes the servers to be memory-bound, and as such
they have more CPU available than they need relative to the amount of RAM.

I am (and I think a lot of other people are) in the same situation.

------
tluyben2
This is a political issue as the article already indicates: even though you
could just buy very energy efficient, but worse performing ARM based servers
with (depending on your needs) networking/memory/IO focus, it is easier to do
the 'no-one has been fired for buying IBM' route: if something goes wrong,
it's easier to say that you bought the fastest thing money can buy (who cares
if it's running at 6% CPU util) than going the 'experimental' route with low
energy equipment in servers risking downtime which could be because of your
decision.

The problem is of course that if you go below the enterprises, it's probably
even worse, where high powered servers with cpanel running max 400 accounts
per server could be put on a microcontroller costing a few cents on average. I
ran servers which had 300-400 accounts on them and where the TOTAL number of
sites had a few 100 requests _per_ _month_. I think this is the case for most
of the 100s of 1000s of servers running at hostgator, godaddy etc. The whole
problem is 'peaks'. Management will ask you 'nice story, but what about
peaks'? So you just buy the fastest thing there is, slap VMWare on and hope
for the best. I think power management like in mobile phones would not be that
bad for this purpose; most of the time you just switch off all cores but one
and power down the clockspeed; when a peak comes, you just power up fully.

~~~
travem
In the enterprise datacenter level it is not uncommon for sysadmins to use
VMware's Distributed Resources Scheduler (DRS) coupled with Distributed Power
Management (DPM) to balance workloads and power down underutilized hosts.
There are probably similar solutions from other vendors (disclosure: VMware
employee)

------
JimmaDaRustla
I'm sure that this has been said many times, but the "utilization" is not the
same as the need to have the availability of the capacity - we could run some
servers on a 386 and achieve 90% utilization, but now it takes two minutes to
view a web page!

Also, CPUs are known to "step down" their clock speed to be power effecient
when not under load - this is not included in a percentage of utilization.

Also, like someone else mentioned, bigger costs are the cooling techniques.
Personally, I can't wait to see stuff like this utilized in the industry:
[http://techcrunch.com/2012/06/26/this-fanless-heatsink-is-
th...](http://techcrunch.com/2012/06/26/this-fanless-heatsink-is-the-next-
generation-in-cpu-cooling/)

------
alexchamberlain
Let us not forget that running servers at 100% CPU/Memory/Network IO
(whatever) is bad too. Sudden spikes or a machine crashing will kill your
service.

~~~
chubot
Sure but they said 80% RAM and 6% CPU utilization. It's better according to
both your and their logic to run at 50% RAM and 50% CPU.

~~~
alexchamberlain
Yeah, totally agree.

------
staunch
CPU performance is simply far beyond what we need in most cases.

What's the average CPU utilization of desktops/laptops across the world? I
wouldn't be surprised if it was even lower.

My laptop:

    
    
      Cpu(s):  1.4%us,  0.1%sy,  0.0%ni, 98.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

~~~
chubot
See my comment above -- this is true now, but I believe it's because our
software and programming methodology isn't flexible enough to use resources
efficiently. Whenever you write code you're essentially hard-coding the
balance between CPU and memory. It doesn't cause much of an issue on desktop
machines, but with the growing number of data centers, it will start to be
economical to have more flexibility in our code.

------
greenyoda
Data centers have to be designed to handle peak capacity, not average
capacity. For example, if a retail business's web site couldn't handle the
peak loads in the month before Christmas without degrading response time,
they'd go out of business.

------
JakeFratelli
The real issue is the lack of testing on the triple-redundant backup power
systems. Utilization is meaningless on a server that isn't turned on.

------
ollybee
Hosting companies are in the business of selling servers. In my experience
most customers dedicated servers are hugely underutilised but they only get
the higher level of support or service if they take out dedicated tin.

