
Feeding data to 1000 CPUs – comparison of S3, Google, Azure storage - ranrub
http://blog.zachbjornson.com/2015/12/29/cloud-storage-performance.html
======
oavdeev
Stock Ubuntu needs SR-IOV driver to get to the actual bandwidth limit on ec2,
it makes a lot of difference. We routinely get to ~2 Gbps down from S3 with
that setup (using largest instance types).

edit: Gbps not GBps

~~~
rmcpherson
That's true, although the latest stock ubuntu HVM AMIs (14+, I believe) have
the SR-IOV driver already and use it by default. Older AMIs need to have it
installed and enabled on the AMI. I believe enhanced networking is only
available on HVM amis.

~~~
oavdeev
This problem definitely existed with official 14.04 (HVM) AMI, though I
haven't re-tested this recently, they may have fixed it. It did have some kind
of SR-IOV driver but it was too old.

------
lowbloodsugar
If you are pulling large files from S3 we have found that they can be sped up
by requesting multiple ranges simultaneously. It is easy to hit 5Gb/s or
10Gb/s on instances with the necessary bandwidth, accessing a single file, or
multiple files. We have not encountered a limit on S3 itself. YMMV.

~~~
hrez
Excellent
[https://github.com/rlmcpherson/s3gof3r](https://github.com/rlmcpherson/s3gof3r)
is my tool of choice for "fast, parallelized, pipelined streaming access to
Amazon S3."

If you want to saturate network bandwidth with S3 that's the one tool I know
that can do it.

------
jedberg
AWS has a limit on the total throughput any one _account_ can have to S3, so
the more CPUs OP adds, the worse OPs performance will be on each one. I
suspect the other providers have the same restriction.

I either missed it or OP didn't specify how many instances they was using at
once to run their benchmark, but the more instances they used, the worse it
will be per node.

This did not seem to be accounted for.

EDIT: OP says below it was from one instance, so what I said doesn't apply to
this writeup.

~~~
zbjornson
All the benchmarks were from a single instance.

(Note that I have done some testing from AWS Lambda, where we had 1k lambda
jobs all pulling down files from S3 at once. That's a bit harder to
benchmark...)

~~~
jedberg
Hi OP, nice writeup! I hope my comment wasn't construed as dismissing the
work, just a criticism of one small part.

It sounds like that wouldn't have been a factor, except for the cap you seem
to have discovered on Amazon that you called out.

My only suggestion then is you may want to make it explicit that you ran the
benchmarks from a single instance.

~~~
zbjornson
Thanks! Not at all, it's a great point and something I didn't realize would
play into the equation.

------
ChuckMcM
When I see things like "data set size 150GB" and "1000 CPUS" I just naturally
assume they are all in memory and never come from disk :-)

~~~
zbjornson
That's one of many data sets on the server, so unfortunately we can't keep
them all in memory at once. :(

~~~
ChuckMcM
Lets assume when you're saying "cpu" when you mean "core" and your typical
server class machine has 24 of those. A 1000 "cpus" is 41 machines, if they
each donate 32GB to the cause[1] that is 1.3TB worth of data which is only a
few microseconds away from any core.

I'm not sure why anyone would build a server with less than 96GB on it these
days, so its not at all unreasonable. Now your service provider my jerk you
around but you can run two racks of machines (48 machines) in a data center
with specs like that for about $25K/month (including dual gigabit network
pipes to your favorite IP transit provider) So it isn't even all that huge of
an investment.

[1] Consider your typical 'memcached' type service where data is named as a
function of IP and offset.

------
ranrub
with kernel tuning, S3 performance improves (and will probably improve on
GC/Azure as well). Also, author uses Ubuntu 14.4 (see
[https://twitter.com/Zbjorn/status/684492084422688768](https://twitter.com/Zbjorn/status/684492084422688768)),
which doesn't use AWS "Enhanced networking" by default. Would be interesting
to see results for tuned systems.

------
skywhopper
Very interesting comparison, glad to see it. I don't have a comment on the
content itself but I do have a note on the presentation.

The colors used for S3 and Azure Storage in the graphs are very near
indistiguishable to me, as I have moderate red-green colorblindness. It's
easier to tell apart on the bar graphs, since the patches of color are much
larger, although I still have to work at it, and use the hints of the labels,
but on the line graphs, it's basically impossible to tell apart. A darker
shade of green would solve the problem for me personally, but I'm not all that
bad a case, nor an expert on the best shades to pick for general color-
blindness accessibility.

Just something to think about when presenting data like this.

~~~
mistermann
Color blind here as well, I had to zoom in incredibly close to distinguish the
difference.

~~~
zbjornson
Thanks for pointing this out, and my apologies! Will fix that going forward.

------
jen20
Has the author (if they are reading here) considered using Joyent's Manta to
take the processing to the data instead?

~~~
vgt
There are plenty of architectures that do exactly this. EMR-on-S3, Google
Dataproc on GCS, Snowflake-on-S3, BigQuery-on-GCS, etc etc.

The bigger point in the article is that these exact "take processing to the
data" architectures operate exceedingly well on S3, GCS, Azure.

And, as a biased observer, these architectures operate on GCS the best due to
great performance measured in the article, quick VM standup times, low VM
prices, and per-minute billing.

~~~
zbjornson
I'm still trying to parse the docs and Manta source code to see what it
actually does, but it seems unique if the data storage nodes are also the data
processing nodes and no data transfer happens from some storage service before
the job begins. The other key factor is having neither startup time nor the
cost of a perpetually running cluster. Per my comment below [1], we have used
Lambda with S3 to get something like this, as well as our own architecture
built on plain EC2/GCE nodes.

[1]
[https://news.ycombinator.com/item?id=10846514](https://news.ycombinator.com/item?id=10846514)

~~~
qaq
Not only that but the thing is built by guys who really know what they are
doing like Bryan Cantrill and other former SUN top people.

------
rmcpherson
In S3 tests on c3.8xlarge instances, I've seen 8 Gbps throughput on both
uploads and downloads using parallelized requests. Testing with iperf between
two of the same instances maxed out about 8 Gbps as well so the throughput
limitation is likely EC2 networking rather than S3.

These tests were done over a year ago so bandwidth limitations on EC2 may have
changed since.

This testing was with
[https://github.com/rlmcpherson/s3gof3r](https://github.com/rlmcpherson/s3gof3r)

~~~
zbjornson
That's really cool. Wonder if the same technique (parallel streams) would help
for Azure and GCS. I know GCS has some built-in capabilities for composite
uploads/downloads, which might achieve a similar effect.

------
imperialdrive
Thanks for sharing your research - I've been up to the neck in EC2 migrations
and trying to benchmark as I go... S3 is the neck chunk of work. Rock on!

------
hrez
What missing from description is network setup. Is it ec2 classic, VPC? Is ec2
getting to s3 through IG? Hopefully not through NAT. There is also VPC
endpoint to s3. Which all may have different performance profiles especially
with multiple instances.

~~~
zbjornson
Network was VPC. The EC2 instance had an IG attached, yes, but I'm not sure if
you're asking if an internal vs. external URL for S3 was used? Are you saying
there's a better endpoint than s3-<region>.amazonaws.com for S3 requests from
EC2?

~~~
hrez
I meant [http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-
en...](http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-
endpoints.html)

It's a private connection to AWS services including S3. You'd use the same URL
as it's a routing basically. No idea if VPC endpoints would be better than IG
though. P.S. Just tested and I get about half of the latency on VPC endpoint.

~~~
zbjornson
Neat, did not know about that. Will add it to my follow-up benchmarks. Thanks
for all the comments. :)

------
dwelch2344
I'd be interested to see how AWS' Elastic File System (EFS) compares (though
I'd imagine it's not great, given it's mounted via NFS)

~~~
zbjornson
I've been on the list to get into their preview program for a while so I can
benchmark it, actually! Part 3 of the blog post is going to include some NFS
stuff either way.

~~~
acdha
When you do, it would be really useful to include the classic fio/bonnie/etc.
stuff to break down performance by the type of operation (e.g. file creation /
deletion, streaming read/write, random read/write) and block size.

EFS supports NFSv4 so it should avoid being as routinely limited by server
round-trip latency as NFSv3 tends to be but it'd be nice to see how well that
works in practice.

------
frik
How reliable is Azure? For example the story of Gitlab on Azure was a
disaster:
[https://news.ycombinator.com/item?id=10781263](https://news.ycombinator.com/item?id=10781263)
Something like that wouldn't happen on AWS, GC, Softlayer, etc.

------
qaq
WTF would one deploy such thing in the cloud?

~~~
cottonseed
Because renting 1000 cores for a limited time is much cheaper than buying them
outright?

~~~
qaq
1000 cores of what ? Vcore is marketing BS. Even if it was not marketing BS
it's 28 2U 3 node boxes (if using older cpus) or 14 2U 3 node boxes (if using
more recent ones) unless they have extremely spiky workload using AWS is
pointless. Bandwidth bound scientific apps ==> use infiniband cluster.

~~~
cottonseed
The OP is talking about running $0.027 worth of computation (1000 cores for
10s at 0.01/core/hr) and you think he should spend tens of thousands on
hardware?

I'm not doubting a custom build will give him much greater bandwidth. I just
doubt the workload has to be "extremely" spiky to make the cloud cost-
effective.

Of course, he's going to get billed for 10m or 1hr minimum (Google or Amazon),
so that's assuming he can amortize his startup across multiple jobs.

~~~
semi-extrinsic
The big question is, why does it need to run in 10s? The main reason I can see
is to be able to run this analysis very frequently, but then your workload is
approaching constant.

The total amount of data is 150 GB; that would easily fit into memory on a
single powerful 2-socket server with 20 cores and would then run in less than
15 minutes. The hardware required to do that will cost you ~ $6000 from Dell;
assuming a system lifetime of five years and assuming (like you do) that you
can amortize across multiple jobs, the cost is roughly the same as from the
cloud, about $0.036 per analysis.

I'm fairly certain that, in the end, it's not more expensive for the customer
to just buy a server to run the analysis on.

Edit: I see OP says 80% of the time is spent reading data into memory, at
about 100 MB/s. Add $500 worth of SSD to the example server I outlined, and we
can cut the application runtime by >70%, making the dedicated hardware
significantly cheaper.

~~~
qaq
Vcore is hyperthread of unknown CPU. So in reality 1000 vcores is 500 real
cores. - All the overheads it's more like 450 given the low utilization until
dataset loads to keep it at 10 sec you would need 90 real cores or 4 X 3 node
dual boxes (ebay 1.5K each) and 2 X infiniband switches (ebay 2X300). For 6600
you have a dedicated solution with no latency bubbles fixed low cost.

~~~
zbjornson
Briefly... We have many data sets, and the <10sec calculations happen every
few seconds for every data set in active use. Caching results is rarely
helpful in our case because the number of possible results is immense. The
back end drives an interactive/real-time experience for the user, so we need
the speed. Our loads are somewhat spikey; overnight in US time zones we're
very quiet, and during daytime we can use more than 1k vCPUs.

We've considered a few kinds of platforms (AWS spot fleet/GCE autoscaled
preemptible VMs, AWS Lambda, bare metal hosting, even Beowulf clusters), and
while bare metal has its benefits as you've pointed out, at our current stage
it doesn't make sense for us financially.

I omitted from the blog post that we don't rely exclusively on object storage
services because its performance is relatively low. We cache files on compute
nodes so we avoid that "80% of time is spent reading data" a lot of the time.

(Re: Netflix, in qaq's other comment, I don't have a hard number for this, but
I thought a typical AWS data center is only under 20-30% load at any given
time.)

