

Greplin (YC W10) AWS Benchmarks and Best Practices - smanek
http://tech.blog.greplin.com/aws-best-practices-and-benchmarks

======
epi0Bauqu
Thanks for putting this together. A few questions come to mind:

1) Did you try bigger instance types, i.e. large? I've heard at that level and
above you get better networking performance, which theoretically might improve
EBS performance?

2) Related to #1, I wonder if you've tried doing RAID0 on the ephemeral
drives? You get two with the large instance, and four with xlarge.

3) Did you ever measure non-IO network EBS performance during the tests? I've
always wondered whether using EBS heavily would slow down other network
traffic to the device given there is only one interface.

4) How often have you yourself experienced EBS volume failure in your RAID
volumes?

5) When that happens, what happens to your volumes and instance? That is, what
do you use to monitor when the RAID volume degrades? Does it usually take down
the instance immediately or only after some time? If it just becomes really
slow does that throw alarms or does the application just become really slow?

6) Finally, what is your current procedure for dealing with a volume failure?

Well that ended up being a lot of questions. I'd really appreciate any
answers/insights your or anyone else could shed on these questions. I've been
reading all these posts but it feels a bit like reading tea leaves.

~~~
jread
In response to question #1. I've conducted some similar benchmarking across
all instance sizes and using both ephemeral and EBS with and without Raid. EBS
is notably faster on larger instances. We observed roughly 2.5-3X better IO
with EBS backed m2 instances compared to c1.medium. Using ephemeral Raid 0 on
the cc1.4xlarge was about 6x faster.

[http://blog.cloudharmony.com/2010/06/disk-io-benchmarking-
in...](http://blog.cloudharmony.com/2010/06/disk-io-benchmarking-in-
cloud.html) [http://blog.cloudharmony.com/2011/04/unofficial-
ec2-outage-p...](http://blog.cloudharmony.com/2011/04/unofficial-ec2-outage-
postmortem-sky-is.html)

------
SriniK
Great test. I wonder if concurrent db access(submit db reads to multiple db
servers and pick the first one) would enable workaround these weird variations
in EBS response times.

Adrian C from netflix had a nice tech article on EBS performance.
"Understanding and using Amazon EBS - Elastic Block Store"
[http://perfcap.blogspot.com/2011/03/understanding-and-
using-...](http://perfcap.blogspot.com/2011/03/understanding-and-using-amazon-
ebs.html)

------
assiotis
Problem with RAID is not just drives failing, but also the UBE (Unrecoverable
Bit Error). Lets say you have a RAID5 configuration. In addition to the
probability of disk failure, you need to account for the probability of some
failure plus the controller's inability to repair the problem because a sector
on the good disks experienced an UBE. The probability of UBE on enterprise
level disks is rather low (10^-16 I think) but that quickly shoots up if you
start considering 1TB and 2TB drives.

Problem with EBS benchmarks is that they largely depend on who else is sharing
the spindles with you at the time the benchmark was run. Given the large
variance in performance that is being reported, the sample size for some
reliable statistics would need to be quite large

------
krobertson
"RAID helps smooth out flaky performance"

Really? My experience with EBS has been more to the opposite... especially
with write performance.

If you got to write data, you got to write data. Flakey performance becomes
more likely as more volumes are added.

Previously I think AWS engineers recommended we use RAID0 with EBS... ugh.
Don't care cloud or not, RAID0 on production DB servers sounds downright
suicidal.

~~~
smanek
I'd guess that writing to the RAID is still bottlenecked by the worst drive in
the array. But if only one out of N blocks has to be read/written to that
drive, average performance may improve (basically, the idea of 'regression
towards the mean').

I don't know if that's what actually happening, but testing seems to confirm
that RAID performance is more consistent than single drive performance.

RAID0 on EC2 isn't as insane as it appears at first glance. EBS's have an
annual failure rate of around 0.2%, while hard drives are around 5%. So, the
chances of 1 EBS drive failing is around the same as the chance of 2 physical
hard drives failing (5%^2 = 0.25%). It doesn't sound unreasonable to suggest
that a RAID0 of EBSs is about as reliable as a RAID10 of physical drives and
more reliable than a RAID5 of physical drives.

~~~
chrisbolt
What good is a reliable datastore that is slow as molasses?

~~~
smanek
If you take a look at our benchmarks, you'll see that with RAID you can get
100 MB/s of seq reads (or ~5MB/sec of random reads). Even on a really bad day,
speeds will only fall to around a third of that.

While that's not the fastest thing in the world (e.g., a good SSD will
outperform a 16 drive EBS Raid for most workloads), I don't think it's fair to
characterize it as 'slow as molasses'.

~~~
chrisbolt
Most people are probably using EBS to run a relational database, or something
which will be doing random reads more likely than sequential reads. And
speaking from experience, a 4 drive EBS raid couldn't even match the
performance of a 4 drive RAID-10. Once we started adding SSDs, the gap widened
significantly.

