

Comparing Cloud Compute Services - jread
http://blog.cloudharmony.com/2014/07/comparing-cloud-compute-services.html

======
hhw
The problem with benchmarks are that it's really, really difficult to emulate
real-world conditions. However, here some of the more obvious points that are
unrealistic.

1) Comparing between same number of cores. The core count selected for each
testing level is completely arbitrary. With both web and database servers,
which scale well to increasing core count, single-threaded performance is
generally less of a concern and should not be a point of measure aside from
average page load time. Some server configurations are optimized for higher
numbers of slower cores, while others are optimized for fewer but faster
cores. By comparing like core counts, this testing is highly skewed to the
latter.

Comparing packages at the same price point, or just how the package fits into
the product offering lineup (smallest, median, largest instance) would be much
more fair to compare. If 4 cores at one provider costs the same as 1 core at
another, it should be fair to compare the two at different core counts.

2) Server configurations. For both web and database servers, the best
performance optimization that can be done is to cache to RAM as much as
possible. With increased caching, the need for disk I/O also goes down
significantly, and can easily be by as much as an order of magnitude. Serving
static content uses minimal resources and is mostly dependent on network
performance. Dynamic content is more CPU intensive, and most of the time you
can and should be caching the compiled opcode/bytecode. Most website database
usage is read heavy, and many of the queries can be cached as well. The one
drawback to a heavy emphasis on caching is that if the server restarts, there
may not be enough resources to service all requests while warming up the
cache. However, given that dynamic loads is precisely what cloud offerings are
supposed to excel at, you can spin up additional instances at these times, or
just take a horizontally scaled approach to begin with so that a single
instance failing will not have a major impact on your aggregate load.

3) Synthetic benchmarks, by their very nature, do a poor job of emulating real
world performance. The best way to benchmark both web server and database is
to take a real site, log all the requests, and replay the logs. What you want
to measure for is the maximum number of requests or queries that can be
served, average time and standard deviation at different requests/query rates,
etc.

4) Network speed tests. The biggest mistake that most tests make is that they
measure performance from content network to content network, rather than from
content network to eyeball network. Especially with the current peering issues
going on between carriers and eyeballs, this is more important than ever. This
is a very difficult problem to solve however, as it's not easy to do
throughput tests from a large number of different eyeball networks. You would
have to take a very large number of client generated results, and compare
differences for all the different providers in all their different locations,
which would be nearly impossible. The next best thing, while still a lot of
work but more feasible, is to collect up IP's for eyeball networks for as many
different locations as possible, but perhaps just the top X number of cities
by population, and run continuous pings/traceroutes over an extended period of
time. You can then just use average latency, standard deviation, and packet
loss % as the metrics.

~~~
jread
Appreciate the constructive feedback. Just a few points of response:

1a) Core selections were not arbitrary - they covered most compute instance
sizes offered by each service: 2, 4, 8 and 16 cores

1b) Testing was not skewed by faster cores. CPU performance analysis was based
on SPEC CPU 2006 using base rate runtime. SPEC CPU is a multi-core benchmark
with metrics that scale linearly on number of CPU cores.

1c) The value analysis (based on CPU performance and price) highlights
differences in pricing between services. I believe matching CPU cores and
deriving value is preferable to comparing compute instances based on price.

2) The intent of the post was to provide and compare multiple relevant
performance characteristics for these types of workloads: CPU, memory, (non-
cached) storage IO, network. Each of these characteristics is analyzed
separately using relevant benchmarks and runtime settings. If your workload
relies more heavily on cached IO, then more emphasis could be placed on the
memory performance analysis.

3) There are hundreds of cloud services. High level analysis like that
provided in this post can at least get users pointed in the right direction.

4) Web server network performance analysis is based on "eyeball" or last mile
testing from our cloud speedtest
[http://cloudharmony.com/speedtest](http://cloudharmony.com/speedtest). As
stated in the post, this is the most complex performance characteristic to
measure. Internal network performance is also included in the DB server
analysis.

~~~
hhw
> 1b) Testing was not skewed by faster cores. CPU performance analysis was
> based on SPEC CPU 2006 using base rate runtime. SPEC CPU is a multi-core
> benchmark with metrics that scale linearly on number of CPU cores.

I wasn't suggesting that the testing was skewed by faster cores, but that your
criteria favour hardware configurations optimized for fewer, faster cores
because of your comparisons between packages of the same core count.

> 1c) The value analysis (based on CPU performance and price) highlights
> differences in pricing between services. I believe matching CPU cores and
> deriving value is preferable to comparing compute instances based on price.

We'll have to agree to disagree on this point. It's been my experience that in
real-world situations when comparing options, nobody is going to reject higher
core counts when at the same price. Generally, a decision maker looks for the
lowest price that meets all requirements, or the best option that still fits
within budget. They're not going to simply say that since one provider offers
a package with this particular core count, that they'll only compare against
other providers' options with the exact same core count regardless of price.
That's what I mean by the core selections being arbitrary.

For example, an E5 2643v2 6x3.5GHz CPU costs about the same as an E5 2670v2
10x2.5GHz CPU. The 2670v2 offers approximately 25% more aggregate performance,
and is clearly the better option except for the unusual cases where single-
threaded performance is more important than aggregate performance. However,
given how you choose to compare different packages with the same core counts,
infrastructure using the 2643v2 would be favoured in your testing. The
decision to compare a single 2643v2 core vs a single 2670v2 core, would be
completely arbitrary.

> 2) The intent of the post was to provide and compare multiple relevant
> performance characteristics for these types of workloads: CPU, memory, (non-
> cached) storage IO, network. Each of these characteristics is analyzed
> separately using relevant benchmarks and runtime settings. If your workload
> relies more heavily on cached IO, then more emphasis could be placed on the
> memory performance analysis.

My point is, the vast majority of web and database servers either rely heavily
on cached IO, or should be. SSD's should not be necessary for web servers; you
will be able to serve a higher number of requests with the same budget using a
larger amount of RAM instead. Likewise, you should fit your entire database
into RAM if you can, and just resort to SSD's when you can't. And this isn't
limited to I/O, you can reduce the amount of CPU resources required by caching
to RAM also. As such, doing a comparison of packages based on the amount of
RAM included is going to be more meaningful than doing a comparison based on
the number of cores.

------
jdub
This is terrific work. I agree with other posters that the graphs and
supporting information could be improved, but underneath the presentation of
the results, you've done a VERY good job avoiding the pitfalls most
comparisons suffer. As this was mentioned in your intro and conclusion, it was
clearly one of your goals. Nailed it. :-)

------
traek
I found the results very interesting, but I'm not a big fan of the charts. The
x-axis is categorical, not continuous, so a line graph isn't appropriate here.

~~~
dxbydt
> so a line graph isn't appropriate here.

Not just inappropriate, but plain incorrect as well. The charts as they stand
imply that the user can pick an offering between "small server" & "medium
server" and you'll get a performance along the interpolated line ?! Use
scatterplots.

~~~
jread
Point taken - graphs will be improved next time.

------
davidjgraph
My colour blindness is relatively mild, but I struggled with many of the
charts. I'd suggest putting the labels on the actual lines, either in the
chart area or one side or other next to the line termination.

It sounds trivial, but with 20+ charts that I had to stick my face up to the
screen for, I gave up after about 5.

Regarding "Due to SPEC rules governing test repeatability (which cannot be
guaranteed in virtualized, multi-tenant cloud environments), the results below
should be taken as estimates.". I'd like to have seen some attempt to migate
this with, say, some kind of averaging of x tests over different instances.
Although, I understand this would have extended an already long test process.

~~~
jread
The SPEC CPU 2006 estimate comment is a legal requirement for any public
results that cannot be guaranteed to be repeatable (which will never be the
case in a virtualized, multi-tenant cloud environment). The results provided
are based on the mean from multiple iterations on multiple instances from each
service. If you were to run it on these same instances your results should be
similar. The post also provides standard deviation analysis for all SPEC CPU
runs on a given instance type to demonstrate the type of results variability
you might expect.

------
mikektr
DigitalOcean got burned pretty badly in the network performance and latency
benchmark, not to mention disk I/O. DO also came in last for availability.
Amazon seems to come out on top in most categories (including
price/performance with T2) except a few cases where Rackspace did really well
for database I/O and Softlayer doing really well in large database random
read/write throughput.

------
walterbell
High-quality content. You should repeat the download link at the bottom of the
page, after the reader has been impressed by the data and may want more, but
will definitely have forgotten this:

"This post is essentially a summary of a report we've published in tandem. The
report covers the same topics, but in more detail. You can download the report
for free."

~~~
jread
Good feedback, thanks - just added a link to the end.

------
solarwind4
Very detailed results. It's interesting how poorly digital ocean did for
consistency of IO. Amazon, Rackspace and Softlayer seemed to fare best in many
categories ahead of Azure, GCE, digital ocean etc. GCE database IO performance
seemed particularly poor. Amazon seems to be far ahead of the rest for
internal network throughput and latency.

~~~
jread
GCE no longer offers a local storage option, this is one reason slower IO.
They also cap IOPS for better IO consistency.

------
asb
It's a shame Linode weren't included in this now they too have per-hour
pricing.

~~~
jread
Linode will be in the next report. We chose the services and began working on
this report before Linode announced their infrastructure upgrades.

------
pinhead
I was disappointed they didn't include internal network latency variability
like they did with disk performance. I've seen EC2 have wildly different
network latencies at times, but haven't tried any of the other services.

~~~
jread
Mean latency RSD was EC2: 8.9%; Azure: 10.3%; DigitalOcean: 23.7%; GCE: 8.9%;
Rackspace: 6.1%; SoftLayer: 16.8%.

~~~
pinhead
Good to know, thanks for the reply. Does this seem to change with instance
type?

~~~
jread
Not really for latency - throughput is more variable between instance type,
particularly for services like Rackspace that have instance specific limits.

------
cordite
I am glad that the colors were consistent between the graphs so I spent less
time referring to a legend.

If reports like this become regular (say, a monthly occurrence), would it be
possible or feasible for the cloud providers to try to game (or optimize for)
certain qualities?

~~~
jread
It is possible - but it would be difficult to optimize for every performance
characteristic because they are derived from many different benchmarks. This
was a problem in the tpc-c years with database vendors optimizing specifically
for a single benchmark.

~~~
cordite
Does that have anything to do with the DeWitt clause? (which is apparently
also in the datomic license (last I looked))

------
brendangregg
What observability tools were used to confirm that the target of the test was
actually being tested properly?

I've performed, and also debugged, a lot of these cloud comparison benchmarks,
and it's very, very easy to have bogus results due to some misconfiguration.

~~~
brendangregg
As an example of some detail that is lacking:

What is the total file size for the fio runs? (And, what is the intended
working set size?) Was fio configured to bypass the file system cache and
perform I/O directly to disk? (And if so, what is the rationale for bypassing
it?[1]) Was iostat or other tools run during the benchmark to confirm that fio
was configured and operating correctly, and that the results could be trusted?
Was the same version of fio used, and built with the same compiler (same
binary?).

The 118 page report does not include the actual fio commands used.

[1] Disk I/O latency can be a serious issue in cloud environments, and one
that vendors can address by incorporating additional levels of storage I/O
cache (eg, in the hypervisor). Picking benchmarks that bypass discourages
vendors from doing this, which is not ultimately good for our industry.

~~~
jread
The intent of the fio testing was to measure block storage not cache I/O.
direct I/O was set in fio configs, but is usually not honored by the
hypervisor. During run-up a 100% fill test was performed using refill_buffers
and scramble_buffers to break out of cache. Then optimal iodepth settings are
determined for each workload and block size by running short tests with
incrementing iodepth settings (targeted for maximum iops). Once iodepth is
determined, 3 iterations of tests are performed, each with 36 workloads (18
block sizes, random + sequential). Each of these is 15 minutes (5 minute
ramp_time, 10 minute runtime). Since asynchronous IO and variable iodepth
settings were used, latency wasn't compared. Total test time per instance for
run-up and 3 iterations was about 36 hours. fio configs are available here
(iodepth and device designation are added at runtime):

[https://github.com/cloudharmony/fio/tree/master/workloads](https://github.com/cloudharmony/fio/tree/master/workloads)

~~~
brendangregg
Thanks, but I disagree with the approach of only showing storage benchmarks
with disabled caches. Production workloads will encounter variance between the
providers thanks to different caches and behaviors of handling direct I/O. I'd
include direct I/O results _with_ cached results, so that I wasn't misleading
my customers.

I know what I'm suggesting is not the current norm for cloud evaluations. And
I believe the current norm is wrong.

The more important question is how the benchmarks were analyzed -- what other
tools were run to confirm that they measured what they were supposed to?

~~~
jread
Good point - user experience may include cached and non-cached I/O so it would
be beneficial to include both in this type of analysis.

The benchmarks binaries, configurations and runtime settings were generally
consistent for instance types of the same size across services, but we didn't
verify efficacy of the benchmarks as they ran.

------
CMCDragonkai
There's also a thing called ServerBear, its like server benchmark ratings but
ran by users.

------
gidgreen
See also the continuously updated benchmarks at cloudlook.com

