

Apache Benchmarks for Calxeda’s 5-Watt Web Server - mariuz
http://armservers.com/2012/06/18/apache-benchmarks-for-calxedas-5-watt-web-server/

======
reitzensteinm
There are lies, damned lies, and benchmarks.

The Xeon machine is running at under 15% utilization due to the gigabit
ethernet bottleneck, yet they're using the system TDP (that's the worst case
thermal specs which you design your cooling around).

If you put a 10 gigabit NIC in there, the advantage would be closer to 2x
performance/watt, since the Xeon would be processing 6.66x as many requests
(1/0.15). This is also a last generation Xeon - Ivy will be out by the time
this CPU is.

That they're not just taking the power at the wall of both systems probably
means doing so would tip the balance closer to the Xeon - and you pay for at
the wall power, so that's what matters.

In addition, their choice of 4gb for the platform is worrying; it's almost
certainly a 32 bit machine. Which rules out an almost perfect use case for the
machine; a memcache server.

ARM in servers is coming, to be sure. 64 bit ARM will bring unprecedented
gigabytes of ram per watt. But toe to toe on CPU heavy tasks, they will still
lose out badly on performance/watt at full load.

At least from the numbers in the post, I'd stick with virtualisation.

~~~
mykhal
ad RAM: the platform (if it is real) is little bit more interesting - they
have cards with 16GB RAM per four 4-core CPUs (SoCs) - these cards are to be
used together in amount of tens, connected with high speed switched fabric in
the host board.. so it's something like server cluster on board design..
[http://www.calxeda.com/technology/products/energycards/quadn...](http://www.calxeda.com/technology/products/energycards/quadnode/)

~~~
reitzensteinm
Now _that's_ interesting. Why on earth aren't they leading with it?

I can understand these kinds of faux benchmarks and whitepapers to convince
pointy haired bosses to switch between mature, mainstream technologies.

But the early adopters of fringe technology like this are going to be
companies with specific needs, and those making the decisions are going to be
highly technical. Not to generalize, but they (me included) love details and
possibilities, and abhor marketing puffery.

Calxeda should be sketching out novel ways to use the interconnect bandwidth
to solve hard problems, with power efficiency x86 can't touch. Not running ab
against an ARM core and misinterpreting the results. Sheesh.

~~~
mykhal
well, there is a promising HP project Moonshot, which was going to build on
this platform, but it seems that they recently decided to switch to Intel
solution.. [http://www.engadget.com/2012/06/20/project-moonshot-take-
two...](http://www.engadget.com/2012/06/20/project-moonshot-take-two-hps-low-
power-gemini-servers-let-go/)

------
tomstokes
This is a very poor (and misleading) comparison.

First, the author admits that the Gigabit ethernet link was the bottleneck for
the Xeon system, capping it at a mere 15% CPU utilization. However, he goes on
to use published TDP numbers as the system power draw. TDP numbers are only
approached under the most demanding of loads and 100% CPU usage, which this
clearly was not. At bare minimum, the author needs some sort of power
measurement device to make a reasonable comparison.

Second, serving static web pages is not a difficult task. The Xeon system is
overkill for such a task, so of course it will be less energy efficient. In
the real world, the extra capacity on this server could be used to perform
more difficult tasks or run other processes.

Third, if we assume the Xeon server would scale linearly without being
bottlenecked at 15% CPU usage by the Gigabit ethernet link, then it would
serving approximately 46,300 pages per second, or almost 8.5X that of the ARM
server. Take into account that the actual power consumed in the real-world
will be less than the TDP, and the efficiency gap between the ARM server
becomes very narrow. Even if the Xeon TDP numbers are accurate, the new margin
is still less than 2X.

Finally, the Total Cost of Ownership calculations in the conclusion are based
(as the author admits) on the flawed benchmark numbers. If they can only
achieve a 77% TCO reduction by completing handicapping the Xeon system, then
the ARM system may not be that advantageous. Especially when you consider that
without the bottleneck or with a more demanding workload, you would might need
as many as 8 times as many ARM servers to replace the performance of the Xeon
server, which isn't going to help real-world TCO.

I'm a big fan of the ARM platform, and I'm very excited to see ARM servers
enter the marketplace. However, false benchmarks like these aren't going to
help anyone. ARM servers will have certainly have their place for a lot of
different workloads, but to suggest that they are 15X more energy efficient
and have a 77% lower TCO based on these numbers is disingenuous.

------
jws
We've all savaged Calxeda's blogpost now, but it's worth noting the positive
sides too.

• It looks like the real world power savings will be something like 66%. I
know hosting facilities that base your bill on your watts. That looks like a
powerful incentive.[1]

• Virtualization is nice (#1), but if you aren't a big enough fish to own all
the slices you are at the mercy of your box mates and the financial pressures
of your hosting company. If you have your own ARM server you get to live in a
predictable world.

• Virtualization is nice (#2), but isn't there an embargoed Xen interdomain
security flaw right now? How long have bad people known about it? Is your
hosting provider in on the loop to get the fixes before they become public?

• For small sites, it doesn't matter what the efficiency of a Xeon at full
load is. You won't get there with a dedicated machine, and you don't want to
be on a virtual server that goes there.

• It looks like the boards actually have 10gbit interfaces. The 1gbit limit
was either architectural to get to the client machines or deliberate to keep
the Xeon in the same ballpark. Either way, it is reasonable for sites that
aren't going to have more than a 1gbit drop anyway.

• 48 of these quad core ARM systems fit in a 2U box.

I'd much rather have a dedicated ARM than the tiny slices of Xeons that I use
now. I don't need a random performance problem brought on by anyone other than
myself.

EOM

[1] It may be that if your workload is network bound you won't be offered
wattage based pricing. That might eliminate this savings for the people that
could best use it.

------
jws
I think I might have invested $19 in a Kill-a-watt meter before I published
that benchmark.

Using TDP for the Xeon that is operating at a small fraction of capacity is
going to mislead. Leaving out the disk drive (let's say 5 watts) helps make a
great multiplier, but is divorced from reality.

Their Performance/Watt number in the table is, I think, actually a
transactions/energy. The watt multiplier would be about 19.

------
bhauer
Another thing to consider beyond simply the fact that your particular
benchmark ran into a network bandwidth bottleneck: web server benchmarks
should not be conducted using ApacheBench until Apache makes AB multi-
threaded.

Use a multithreaded benchmark tool such as WeigHTTP (
<http://redmine.lighttpd.net/projects/weighttp/wiki> ). WeigHTTP is
essentially identical in behavior to ApacheBench, but with a -t argument to
specify the number of threads.

You can approximate the same behavior in ApacheBench by kicking off multiple
AB instances in parallel, but then it is up to you to aggregate the results.

------
ck2
We saw similar claims for atom based servers.

If this was accurate, Google would have adopted it immediately, the savings on
their power bill would be astronomical.

The only time atom and arm are 1500% more efficient is at idle.

~~~
sciurus
<http://research.google.com/pubs/archive/36448.pdf>

"So why doesn’t everyone want wimpy-core systems? Because in many corners of
the real world, they’re prohibited by law—Amdahl’s law. Even though many
Internet services benefit from seemingly unbounded request- and data-level
parallelism, such systems aren’t above the law. As the number of parallel
threads increases, reducing serialization and communication overheads can
become increasingly difficult. In a limit case, the amount of inherently
serial work performed on behalf of a user request by slow single-threaded
cores will dominate overall execution time."

------
mykhal
in the next benchmark, use SSL/TLS connections, and we'll see..

------
kevingadd
An actual Xeon is not going to draw its TDP in power under a simple workload
like serving static files. Even if you get it up to 100% CPU utilization, it
will probably not be drawing its TDP. Modern intel processors have a bunch of
mechanisms built in for managing power draw (and similarly, heat output) -
they can clock up and down in response to workloads, bringing a single core up
above standard clock to speed up single-threaded loads and bringing all the
cores down when the machine is doing less intensive things (like running a
message pump or waiting for socket connections).

If anything, I'd expect static file serving on a Xeon to produce no more than
say 40% of TDP. If you're lucky, serving up all the static files will load all
the cores fairly evenly and get the CPU close to '100%', but none of the
floating point or integer logic units will be remotely loaded - it'll be
almost exclusively branch/copy work, which isn't going to put much load on the
CPU itself or draw much power or generate much heat. It's also going to be
spending tons of time waiting (on the NIC, etc) instead of actually doing
computation, which can generate a lot less heat if the waits are done using
the modern busy wait instructions instead of a spin loop.

EDIT: A comment in the OP provides a conservative estimate of 43W for the
actual draw of the entire xeon-based system (not just the CPU) in the
benchmark. He also points out that it has more RAM (which will increase power
draw).

~~~
rbanffy
Serving static files is a kind of thing that could be done with very little
CPU involvement. Once the file is cached (or memory-mapped), just point the
NIC processor to it and tell it to pipe the memory block through the network
connection and head off to nobler jobs.

BTW, are there NICs this clever around?

------
halayli
Again with these silly benchmarks. Let's just throw numbers out there and see
what people can make from them.

Obviously whoever ran this benchmarks doesn't know what the real bottleneck is
here. Hint: It's not the CPU.

6k req/sec is nothing to be proud of. A gevent python web server can handle
10k requests/sec, and a vanilla nginx can make 24k reqs/sec on a commodity
machine.

------
aidenn0
Really "ab" as a web benchmark? I had no idea anybody used this anymore, how
about a benchmark tool that at least supports HTTP v1.1?

------
secure
Aside from the questionable benchmark, can you actually buy one of these right
now?

~~~
mykhal
i'd ask: did anyone see this machine in real? it's funny they're still
promoting with unpopulated PCBs :) [http://www.calxeda.com/wp-
content/uploads/2012/05/Capture387...](http://www.calxeda.com/wp-
content/uploads/2012/05/Capture3870_large.jpg)

~~~
zokier
it's more energy efficient without all those pesky components :)

------
drudru11
Ok, where can I get one?

------
ksec
May be Intel has paid them to publish these numbers? Otherwise I have no idea
why they are stupid enough to post it on the net.

