
"Amazon's EBSs are a barrel of laughs in terms of performance and reliability" - quilby
http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_down_for_6_of_the_last_24_hours/c1l6ykx
======
snorkel
Having been at a startup that used hundreds of EC2 instances and EBS volumes I
can assure you all that Amazon EBS performance is downright terrible and
Amazon didn't inspire any confidence that they could solve it.

Even worse than the EBS performance is Amazon does not offer any shared
storage solutions between EC2 instances. You have to cobble together your own
shared storage using NFS and EBS volumes making it sucky to the Nth power.

EC2 is fine for Hadoop-style distributed work loads, and distributed data
stores that can tolerate eventual consistency, that's all good. But for
production database applications requiring constant and reliable performance,
forget it.

~~~
watchandwait
My experience with the AWS RDS database product has been excellent.

~~~
krobertson
We looked at RDS and had a call with some of their engineers, but we basically
had our EC2 + raid'd EBS set up almost the same as they did, all best
practices already being done.

Since RDS really is EC2 + EBS, they couldn't provide any real assurances it
performed better than our own installation.

We ended up moving off of AWS as a whole. After several discussions about how
we can continue to scale, the ultimate answer was without AWS.

EC2 is great for distributed stuff, but when need something that is heavy IO,
for instance, it is a big problem. Scaling it ends up costing more to work
around AWS's performance problems than to go elsewhere.

~~~
ketralnis
Yeah they have a few products (e.g. EMR, RDS) where they charge by the
instance anyway so you're just paying them by the hour for the five minutes it
would take you to set up the server once

~~~
adpowers
Hmm. I think you underestimate the effort that is spent on those two. RDS has
really good replication which is really hard to configure and set up yourself.
And having configured Hadoop I know it takes more than 5 minutes :) Perhaps
Whirr makes that easier. Also, EMR's Hadoop is tuned to work really well with
S3, which you don't get with stock Hadoop (or even with Cloudera's).

------
SemanticFog
We had consistent serious problems related to EBS for a several-month streak
about a year ago, and I heard almost identical stories from other EC2 users
around the same time. Instances with EBS attached would suddenly become
completely unreachable via the network. Sometimes we had to terminate the
instances, but usually we could revive them by detaching all (or most) of the
EBS volumes, then reattaching and rebooting. Amazon seems to have fixed this
problem, but I wouldn't be surprised if we suffered in the future the way
reddit has.

Overall, EC2 is a very impressive offering, for which I commend Amazon. At
times, I've been so frustrated that I'm ready to switch, but they fix things
just quickly enough that I never quite get around to it. In the end, I'm
willing to accept that what they're doing is hard, there will be mistakes, and
it's worth suffering to get the flexibility and cost-effectiveness that EC2
offers.

------
jameskilton
This comment further down, supposedly from an Amazon employee, paints a grim
picture for EBS:
[http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...](http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_down_for_6_of_the_last_24_hours/c1l7vy1)

~~~
jedsmith
The perspectives of disgruntled employees have been known to be worse than
reality, on occasion. Not definitively saying that's the case here, just
saying.

~~~
hn_throw_away
I work at Amazon - a lot of teams are like this. They're stuck managing a
woefully broken product and spend all of their time propping up the beast,
leaving no capacity left for meaningful fixes (in these cases, meaningful
fixes are always gigantic engineering projects).

The team develops a reputation internally for being glorified firefighting,
and have trouble recruiting. More senior engineers eventually flee (having,
well, choice in the matter) leaving a team heavy with junior talent with no
seasoned gurus leading the way.

The company is also growing at ludicrous speed, and hiring is difficult. When
the product is in such a painful state, attrition from the team is high, and
with slow hiring you are _barely_ countering attrition (exacerbating the
junior talent problem), and not even close to growing the team to be in a
position to take care of the problem for good.

I suspect this is an industry-wide problem though, and is hardly unique to
this place.

~~~
bgentry
I wonder if it's time for AWS to open a development office in a more startup-
oriented city (i.e. SF). It might help them attract and retain more talent.

~~~
WALoeIII
A.) Amazon isn't a startup. B.) A lot of Bay Area companies (Zynga, Facebook,
Salesforce) are opening Seattle offices to take advantage of the Amazon and
Microsoft talent pools. C.) They already have a Bay Area office.
<http://public.a2z.com/index.html> I believe some core SimpleDB guys (Jim
Larson) were based out of there.

------
rlpb
RAIDing together multiple EBS volumes feels like a massive hack to me. I can't
help but wonder if this compounds the problem at Amazon's end. If EBS
performance is a problem, Amazon need to fix it. For example, if some way of
tying together multiple EBS volumes is a reasonable way of working around the
problem, then why aren't Amazon providing "high performance" EBS volumes which
do that under the hood?

If I were faced with EBS performance issues, I would see this as a big red
flag, consider EBS unsuitable for the application and avoid it, rather than
carrying on with such a workaround.

~~~
andrewvc
One other huge downside of raiding EBS volumes is you can't use EBS's
snapshotting features as you cannot guarantee a perfect sync (you could use
LVM yourself however).

Honestly, since EBS vols are supposedly not tied to a single disk, the raiding
should be done on Amazon's end. That it isn't is telling.

~~~
saurik
You have to snapshot at the system level anyway if you want a consistent
snapshot: otherwise the filesystem (or your database) could have been
reordering and delaying writes that end up not being part of the "consistent
snapshot". This is simply not a RAID-specific issue, nor is it a problem with
EBS (as it is generally easy to use LVM, xfs, and/or PostgreSQL to handle that
part of the job).

~~~
agmiklas
This is something I've never quite understood. Best practice guides say you
need to do a "flush all tables" in MySQL and then do a filesystem freeze
(possible in XFS) before you can use a snapshot system like the ones built
into EBS or LVM. If you don't, you apparently stand a good chance of getting
an inconsistent snapshot, even if the snapshotting mechanism itself is (like
EBS and LVM) "point in time" consistent.

Why is all this necessary? If the system (i.e. DB + FS + block device) are all
working as they should, then once a commit returns, the data should be on
disk. If it's not, you have no guarantee data that you thought was committed
will still be there after a kernel panic or power outage.

In that case, no amount of xfs-freeze or table flushing during a snapshot is
going to save you from the fact that your DB is one kernel panic away from
losing what the rest of your system believed were committed transactions.

~~~
parasubvert
This is one reason why Oracle is still the gold standard. when entering hot
backup mode, which is what you do during a snapshot, it logs the FULL BLOCKS
that are changed. Failures and inconsistencies can be replayed from the
archive logs.

Of course this means you can quickly blow out your log archival , so it's
meant to be a transitory mode:

~~~
saurik
PostgreSQL has this exact same feature.

------
parasubvert
Generally speaking this is the sort of thing that people warn about when they
say "if you want to run on a cloud, you need to design your application for a
cloud". Meaning, you can't presume your infrastructure is dedicated and
carries similar MTBFs of (say) an enterprise hard drive, which upwards of 1
million hours.

Amazon provides plenty of opportunities to mitigate for this, such as
providing multiple availability zones. Reddit, if you read the original blog
post, wasn't designed for that - it was designed for a single data centre.

OTOH, the variability of EBS performance is true, and frustrating. If you do a
RAID0 stripe across 4 drives, you can expect around sustained 100 MB/sec in
performance modulo hiccups that can bring it down by a factor of 5. On a
compute cluster instance (cc1.4xlarge) it's more like up to 300 MB/sec if you
go up to 8 drives, since they provision more network bandwidth and seem to be
able to cordon it off better with a placement group.

~~~
khafra
> modulo hiccups that can bring it down by a factor of 5.

The comments on reddit indicated hiccups more on a factor of 10x and,
sometimes, 100x.

Either way, the issue is that the more drives you add to your RAID0, the more
often one of those drives experiences a "hiccup," and kills the performance of
the entire volume.

~~~
parasubvert
It's not clear this was a single volume problem so much as an issue with one
or more network switches in that availability zone (if you look at the AWS
service health notes for that date).

Even in your own data centre, if your FC fabric goes wonky, your whole SAN is
hosed.

------
jedsmith
Never fails: _a_ cloud provider has issues with _a_ specific cloud product, so
clearly _the_ cloud is an illusion that will crash down on you[1]. Any
discussion about any cloud provider's product is obviously a chance to soapbox
about the industry as a whole.

[1]:
[http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...](http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_down_for_6_of_the_last_24_hours/c1l7531)

~~~
magicseth
In the minds of many people, Amazon is the most well known, and respected
cloud provider. Their outage is merely a reminder that the one big difference
between cloud services and in-house services is, I can't control it. Of course
the probability that Amazon will have better uptime than you is pretty high
for most people, but you have no recourse when there is a problem.

------
tzs
We've been looking at moving some or all of our stuff to either Amazon
EC2/EBS/S3 or Rackspace cloud hosting, and it has been interesting.

Amazon seems more flexible, since you buy block storage (EBS) independent of
instances. If you have an application that needs a massive amount of data, but
only a little RAM and CPU, you can do it.

Rackspace, on the other hand, ties storage to instances. If you only need the
RAM and CPU of the smallest instance (256 MB RAM) but need more than the 10 GB
of disk space that provides, you need to go for a bigger instance, and so
you'll probably end up with a bigger base price than at Amazon.

On the other hand, the storage at Rackspace is actual RAID storage directly
attached to the machine you instance is on, so it is going to totally kick
Amazon's butt for performance. Also, at Amazon you pay for I/O (something like
$0.10 per million operations).

Looking at our existing main database and its usage, at Amazon we'd be paying
more just for the I/O than we now pay for colo and bandwidth for the servers
we own (not just the database servers...our whole setup!).

The big lesson we've taken away from our investigation so far as that Amazon
is different from Rackspace, and both are different from running your own
servers. Each of these three has a different set of capabilities and
constraints, and so a solution designed for one will probably not work well if
you just try to map it isomorphically to one of the others. You don't migrate
to the cloud--you re-architect and rewrite to the cloud.

~~~
delano
If you're interested to see how sites perform on EC2 and Rackspace over time:

<https://www.blamestella.com/vendor/ec2>

<https://www.blamestella.com/vendor/rackspace>

~~~
bretpiatt
You're monitoring from AWS US-East it looks like, you'll want to mention that
to give people some context around the latency numbers.

~~~
delano
That true but I think what's more interesting is the number of incidents
(timeouts, exceptions, and significant slowdowns).

------
mithaler
We were bitten by EBS' slowness at my company recently, when moving an
existing project to AWS. You effectively can't get decent performance off of a
single EBS volume with PostgreSQL; you need to set up 10 or so of them and
make a software RAID to remove the bottleneck. It's a fairly large time
commitment to build and maintain, but it's pretty fast and reliable once it's
up and running (cases like the recent downtime notwithstanding).

Can anyone tell me if MySQL fares any better than Postgres on a single EBS
volume? I wouldn't assume it does but I shouldn't be making assumptions.

~~~
joevandyk
Did you use Raid10? I would love to see a post on using postgresql with
ec2/ebs -- how to setup raid, etc.

~~~
grourk
Orion Henry at Heroku wrote about this and described different software RAID
configurations and the performance characteristics of each a while back:

[http://orion.heroku.com/past/2009/7/29/io_performance_on_ebs...](http://orion.heroku.com/past/2009/7/29/io_performance_on_ebs/)

~~~
joevandyk
Yes, but as a lowly developer, I have no idea how to set read-ahead buffers or
change io schedulers.

Plus, that's a year old, would love to see some updated advice. You'd think
Amazon would write more guides like this.

~~~
grourk
Well, that's really just "-setra" and other file system mounting options, and
mdadm (Linux software RAID) configuration options. Yes, there's a little bit
of a learning curve and pain to get things set up, but it's not completely out
of reach.

Despite being relatively old, I think the advice and approach still holds.
Clearly, EBS hasn't improved since then and the need to do this kind of
striping over EBS volumes hasn't been obviated yet.

------
hemancuso
I've never understood how people can use EBS in production. The durability
numbers they quote are bad and they wave their hands around about increased
durability with snapshots, but never quantify what that means.

Hard drives are unreliable and they certainly don't fail independently of one
another - but the independence of their failure is much more independent than
EBS.

With physical dives and n-parity RAID you drastically reduce the rate of data
loss. This is because although failures are often correlated, it's quite
unlikely to have permenant failure of 3 drives out of a pool of 7 within 24
hours. It happens, but it is very rare.

With EBS, your 7 volumes might very well be on the same underlying RAID array.
So you have no greater durability by building software RAID on top of that. If
anything, it potentially decreases durability.

You could utilize snapshots to S3, but is that really a good solution? It
seems that deploying onto EBS at any meaningful scale is a recipe for
garunteed data-loss. Raid on physical disks isn't a great solution either, and
there is no substitute for backups - but at least you can build a 9 disk
RaidZ3 array that will experience pool failure so rarely that you can more
safely worry about things like memory and data bus corruption.

~~~
saurik
The increased durability based on snapshots is actually quite simple, and they
explain it in various places: if one of the drives in Amazon's RAID fails,
they need to bring up a new disk to replace it in the array. When they being
up new disks they typically can do this instantaneously, because they really
just dynamically page fault the drive from your latest snapshot. However, all
dirty data since the last snapshot will have to be copied from the other
drive(s). This is a window of time during which your array is exposed to
unrecoverable read errors losing data. The less dirty data you have, the
smaller this window of time.

------
prakash
We (Cedexis) presented our findings on - How do EC2's East, West, EU & APAC
zones compare: (pdf)
[http://www.cloudconnectevent.com/2011/presentations/free/76-...](http://www.cloudconnectevent.com/2011/presentations/free/76-marty-
kagan.pdf)

If you would like to know more please send me an email: prakash [at]
cedexis.com

~~~
jerf
You should post that to HN, if you haven't already. Possibly wrap a blog post
around it.

------
gruseom
Anybody care to comment on using EC2 with local (what Amazon calls ephemeral)
storage and backup to S3? Seems to me the advantages are: it's cheaper and you
avoid the performance and reliability problems with EBS. The disadvantages?

~~~
krakensden
All of your EC2 instances can disappear without warning and everything on the
local storage is now gone forever.

~~~
gruseom
That's the "backup to S3" part.

~~~
krakensden
That's a fair point, but I don't think it holds up real well. What are the
semantics? Do you block until everything is fully backed up on S3? Are you
continuously taking database snapshots and forwarding them to S3? What happens
if the backups start to fall further and further behind production?

What do you tell the hordes of angry redditors when the last thirty minutes of
carefully (or angrily) composed comments vanish?

------
Kilimanjaro
Lesson for startups: start in the cloud, grow your business, build your own
cloud.

Never trust critical parts of your business to others.

~~~
mkramlich
Good advice but I'd argue there's one tweak to make that even better: start
outside the cloud (say, just some Linux VM's from Linode or whatever), then
only if you get enough real customer/visitor demand to warrant easy/virtual
scaling, then move to a cloud provider. Needing a cloud/elastic hosting
provider is a bit of a Maserati Problem. If you get to the point where you
have to build/manage your own data centers (like Google, Amazon, Orbitz), you
have a Fleet-of-Maseratis Problem.

------
floodfx
I'll probably be downvoted for this but seems to me the root cause of this
problem is Reddit's architectural decision to remain in a single availability
zone. If it wasn't EBS it could have been some other issue related to the
single AZ that could have brought the site down. Blaming EBS, particularly if
you knew it to be a potential weakness in your architecture, seems like a
deflection of responsibility.

~~~
snorkel
Perhaps reddit could've mitigated some downtime with some cross-zone
redundancy, but the underlying frustration is that Amazon does not provide a
well behaved storage solution, which is a very critical infrastructure
component for most web services.

------
bmurphy
Having been running a 200gb millions of transactions per day Postgres cluster
on Amazon's EC2 cloud for two years now, I can attest to the fact that EBS
performance and reliability SUCKS. It is our SINGLE biggest problem with EC2.

200gb really isn't all that big of a database. It shouldn't have to be this
hard.

------
steve918
This very moment our team is restoring Postgres volumes because the EBS
volumes our primary and secondary were on both failed simultaneously.

~~~
obfuscate
Were both in the same availability zone?

------
absconditus
How is it that Amazon.com is so reliable if there are so many problems with
their "cloud" products? Do they not use the same software to run their site?

~~~
gpapilion
If you understand the limitations of the various products you can build a VERY
reliable service. The reddit assumption of a single datacenter and single
technology to store that data was an engineering failure. They essentially
didn't have a disaster recovery plan in place.

~~~
snorkel
I'm sure reddit's engineers are as capable as any for producing a seemless
disaster recovery plan, but the most common obstacle to implementing it is
cost. Most web services choose the occasional risk of downtime in one data
center instead of incurring the cost of being in two data centers at all
times.

~~~
mkramlich
Yep. And there's that whole asymptotic cost/complexity curve where as you
chase more 9's of perfection, your cost and complexity rises out of proportion
to the value you're getting. At the end of the day, no matter how much we
might like Reddit, it's still just a website with social discussion forums and
link sharing, full of non-essential chatter and pictures of kitties. (Again, I
love Reddit, don't get me wrong, but it's far from a Mission Critical resource
for any business or person's life.) So achieving perfect reliability &
performance is probably not worth the cost/pain.

------
jread
I was at the Cloud Connect conference last week. In a session on cloud
performance Adrian Cockcroft (Netflix's Cloud Architect) spoke and said they
do not use EBS for performance and reliability issues. They initially had some
bad experiences with EBS and because of this decided to stick with ephemeral
storage almost exclusively.

The guys from Reddit also spoke about their use of EC2. Apparently they are
running entirely on m1 instances which suffer from notoriously poor EBS
performance relative to m2 and cc1/cg1 instances.

------
danielrhodes
What's the failure rate of EBS versus having direct access to physical disks?
My guess is that at scale, it's probably similar.

Although you would hope that the storage components of AWS's cloud were highly
reliable, I think the main benefit is not single instance reliability but
being able to recover faster because of quickly available hardware.

~~~
bmurphy
I don't have solid numbers, just some experience using this. Ephemeral drives
outright fail more often than EBS volumes, however, EBS volumes suffer
performance degradation significantly more often than ephemeral drives. EBS
volume performance is _HIGHLY_ variable, at all times of day, no matter what
load you throw at it. Ephemeral drives are very consistent most of the time.

Both types of drives CAN and DO fail, so RAID-10, fail over, and replication
are a must have.

------
ck2
I firmly believe "the cloud" is a fad, unless for some reason you own and
operate all the hardware yourself (ie. Google).

Like other technical fads, everyone will probably come back to servers they
can reach out and touch when needed, sooner or later.

~~~
jedsmith
The cloud significantly lowers capital expenditure to get into an Internet-
enabled business, which cultivates the very startup ecology that Y Combinator
exists to leverage and support. Those teenagers who started the Facebook
Pokemon game would have never had the resources to build a scalable solution
with hardware that they own. (That is, unless Y Combinator paid a lot more
money as part of participating. They might also be a bad example, because I
remember that one of them had a successful sale...it's true for a lot of other
ideas, so work with the example.) The cloud lowers the barrier of entry enough
that good ideas can be explored and built, with very little financial risk to
those getting into it.

This was the role of shared hosting in the past. Several years ago, everybody
realized that having root is better. Now, instead of colocating two servers
and negotiating transit and dealing with remote hands, you can spin up two
Linodes for $40 and have enough power to build anything. Critical mass? Add
three more. You're not waiting for a shipment of servers to the datacenter to
handle a sudden load from a positive mention on HN.

Saying that the cloud is a fad and we should all own our gear does two things:
(a) increases humanity's carbon footprint, since most organizations never
utilize hardware to their full potential, and (b) guarantees that only those
with significant capital to buy a fleet, a cage, and power will ever compete
in the Internet space, which is where we were many years ago. It is very
arguable that the cloud is progress, and everybody sitting on the sidelines
calling it a "fad" is scared by it.

Jeremy Edberg of Reddit had a good comment later in that thread, to someone
who paralleled the cloud to electricity generation:

[http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...](http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_down_for_6_of_the_last_24_hours/c1l7e77?context=1)

What sucks is, my remarks really depend on what you define "cloud" as, which
-- partially thanks to Microsoft television commercials -- is currently up in
the air.

~~~
api
The cloud's real advantage is the ability to build out fast, but it is not
cost. It is cheaper to build it yourself and run it yourself _if you know
exactly what you need, and have time to do so._ If you don't, the cloud is
cheaper.

So you're right that the cloud is great for startups. It is not so great for
established stuff.

~~~
jedsmith
Oh, it's certainly about cost. When you talk about paying for your own
transit, your own power, cage space, and remote hands, cloud providers can be
significantly cheaper than owning the hardware. You also lose the
administrative overhead of having to perform drive swaps when your units
degrade -- it's just computing capacity that _exists_ with a minimum of hassle
to you. I think if you add up all of the variables, cloud can (and does) come
out more cost-effective.

I think private clouds are fantastic for established stuff, and many companies
use public clouds to their benefit as well.

I added this in an edit after you replied, but _cloud_ is a term that is
difficult to nail to the wall: my explanation to people that I like to run
with is that the cloud is a way to think about your architecture.

Rather than have a DNS box, two Web servers, a DB box, and so on, then another
server for every development environment, virtualizing the hardware makes a
lot of sense. You get a lot more traction out of each U, and with a large
number of of-the-shelf utilities, you can automate the hell out of that. Need
a clean test environment to try an installation of your software? There are
ways to accomplish that in minutes, and dispose of it and reuse the space.
_That_ to me is a cloud. Virtualization and automation on top of it. That's
what Linode has been doing for nearly eight years now, so it's arguable that
Linode pioneered the cloud space. In 2003, it was just called VPS hosting.

Integrating a public cloud and a private cloud makes a lot of sense, and a lot
of established big-iron is taking this approach. Big players are realizing
that the cloud makes a lot of sense, which we see with HP's announcement that
they intend to enter the cloud market.

------
obfuscate
For a data set in the mere tens to hundreds of GB (in MongoDB, if anyone's
curious), is there any reason I shouldn't conclude from this that I should use
instance storage only (with multi-AZ replication and backups to S3, both of
which I would be doing in any case)? Moderately slower recovery in the rare
event of an instance failure seems better than the constant possibility of
incurable killing performance degradation.

(Edit: I hadn't considered the possibility of somehow killing all my instances
through human error. Ouch. That probably warrants one slave on EBS per AZ.)

------
Zak
I recently had an EBS volume lose data for no apparent reason. I'm not a heavy
EC2 user at all - I was just doing some memory/cpu-heavy stuff that wouldn't
fit in to RAM on my laptop and using EBS as a temporary store so I could
transfer data using a cheap micro instance and only spin up the big expensive
instances when everything was in place. I ended up downloading files on an
m2.4xlarge because the files I had just downloaded to the EBS volume vanished.

~~~
saurik
Are you certain the data left the filesystem buffer and actually got
acknowledged by EBS?

~~~
Zak
No; I'm very much a beginner when it comes to EC2. I unmounted the filesystem,
detached the volume, then shut down the instance.

------
cpg
This seems too much of a coincidence.

We released a dropbox-like product to sync and the back-end is on EBS.
Yesterday we saw two times when a device got filled to 7GB and as it got
closer it became slower and slower and slower. We did not have any
instrumentation/monitoring in place and we were immediately suspect it was
something on our end.

We (wrongly?) assumed reliability and (decent) performance from AWS.

------
j_s
Being totally new to AWS, why does everyone skip right past using ZFS?

[http://blogs.sun.com/marchamilton/entry/a_brilliant_argument...](http://blogs.sun.com/marchamilton/entry/a_brilliant_argument_for_zfs)
"Cloud Storage Will Be Limited By Drive Reliability, Bandwidth ... The key
feature of ZFS enabling data integrity is the 256-bit checksum that protects
your data."

~~~
jodrellblank
ZFS will ensure that what was written to disk comes back to memory
consistently, or with errors spotted. It wont ensure that the right thing was
written to disk, or that the database IDs which were written leave your
database relationships in a consistent state, etc.

ZFS will do nothing about this " _More recently we also discovered that these
disks will also frequently report that a disk transaction has been committed
to hardware but are flat-out lying._ ", for instance, other than tell you the
data you want isn't there to be read - like any filesystem would.

------
PaulHoule
I love the idea behind EBS, a SAN makes life so much easier, but I too find
that EBS glitches are the largest cause of unreliability in AWS.

I'm not immediately planning to move out of AWS, but the trouble with EBS has
certainly got me thinking about other options and has made me much less
inclined to make an increased commitment to AWS.

~~~
drivebyacct2
EBS is not a SAN which is largely the point being made in these comments and
in the other HN article on reddit's post mortem.

------
natch
Isn't EBS intended for stuff like Hadoop job temporary data used during
processing?

This kind of complaint reminds me of people who buy a product that does A very
well, but then they trash it in reviews for not doing B. It was never
advertised as doing B, but you'd never know that from the complaining.

------
amitraman1
We used Amazon and got bad performance in the beginning too. It is bad when
you pull files out of S3. By bad I mean the latency is high.

We tried GoGrid and they lost or crashed our server instance.

I've personally used Rackspace, so far so good, but I've only been doing
development on it.

------
jclouds-fan
Why is reddit relying on only one cloud provider? AWS can/should do better but
service providers of the size of reddit should be using mult-vendor set-ups
for sure.

~~~
jvanenk
It probably has something to do with the group being very small. Sure they
turn a lot of traffic, but there's only so much you can do with a group of
their size on what I imagine is still a limited budget.

~~~
rworth
Sounds like a case of similar to safety systems at a nuclear plant. Not
pressing until it is REALLY PRESSING! Its the usual dilemma, investing
time/moey on something that most likely wont be needed versus adding that cool
feature all the users will immediately see the benefit of. In a competitive
environment, it isn't difficult to understand how they ended up on one vendor.

~~~
davidw
If a nuclear plant has problems, it can kill a lot of people, and wreck the
lives of many others.

If reddit has problems, I suppose the worst that can happen is a cloud of
toxic and poorly thought out comments is released on the internet.

So the tradeoffs they've made, in saving some money, are probably sensible.

~~~
DennisP
> If reddit has problems, I suppose the worst that can happen is a cloud of
> toxic and poorly thought out comments is released on the internet.

Actually, that's what happens when reddit is working :)

~~~
davidw
Well, depending on prevailing conditions, they might be more widely dispersed
rather than contained within the special "echo chamber" that reddit has built
for that purpose.

------
yuhong
On the comment itself, I have this:
<http://news.ycombinator.com/item?id=2339715>

------
lurker17
EMR is a mess too. The Amazon-blessed Pig is almost a year and 2 major
releases behind, and the official EMR documentation seems to describe a
version of EMR that doesn't even exist.

"Elastic" is AWS's claim to fame, but I am not seeing it.

Trying to resize an EMR cluster (which is half the point of having an EMR
cluster instead of buying our own hardware) generates the cryptic error
"Error: Cannot add instance groups to a master only job flow" that is not
documented anywhere.

(Why would Amazon even implement a "master only job flow", which serves no
purpose at all?)

~~~
adpowers
The master only job flow is designed to let users play around with the
instance and discover things without having to pay for a full cluster. A
single node versus multi-node cluster is configured way differently and that
is why you can expand a single node cluster. If you had started with a two
node cluster you would have been able to expand it.

Also, if you want Pig you should complain about it vocally on the EMR forum.
That is the best way to get them to listen to you.

------
Andys
The AWS business model is to sell shared hosting on commodity hardware. Cloud
is a cool buzzword but it is still sharing hardware. Cheap, commodity hardware
is the magic that lets you scale up so big and so fast for a highly accessible
price.

But you're still sharing the same hardware as everyone else and its still just
commodity hardware.

~~~
smhinsey
For what it's worth, it's not entirely accurate to say that you are always
using shared hardware on AWS, at least for your servers. It depends on how you
set up your environment.

