1. No traction? Just put it anywhere, 'cause frankly, it doesn't matter. Cheapest reputable VPS possible. Let's say, Linode.
2. Scaling out, high concurrency and rapid growth? DEDICATED hardware from a QUALITY service provider--use rackspace, softlayer et al. Have them rack the servers for you and you'll still get ~3 hour turnarounds on new server orders. That's plenty fast for most kinds of growth. No inventory to deal with, and with deployment automation you're really not doing much "sysadmin-y" work or requiring full timers that know what Cisco switch to buy.
3. Technology megacorp, top-100 site? Staff up on hardcore net admin and sysadmin types, colocate first, and eventually, take control of/design the entire datacenter.
I simply don't understand why so many of these high-traffic services continue to rely on VPSes for phase 2 instead of managed or unmanaged dedicated hosting. The price/concurrent user is competitive or cheaper for bare metal. Most critically, it's insanely hard to predictably scale out database systems with high write loads when you have unpredictable virtualized (or even networked) I/O performance on your nodes.
But it looks like that attitude has finally caught up with them, especially since they are down to just 3 technical staff (from 5 last week), and two of them are brand new.
I tend to try to find at least 1 good local/regional datacenter for a good portion of a server stack. There is huge benefits (IMO) in being able to drive to your servers and have a face to face meeting if there are issues and/or take your toys and go someplace else if there is a massive outage. If you're in Green Bay, options might be limited, but in any semi-major metropolitan area there are usually enough datacenter options that you have multiple choices.
That being said, there are situations where you may truly need single-node performance that isn't available on AWS.
But yes, you are right. The goal is to have a system that scales itself. Not an easy task for sure.
A former employee is not quite as nice to Amazon.
This wasn't the case a few months ago. Good god I'm getting old, now I know how the poor people at plastic.com feel.
For two of them, it wasn't the case a few days ago.
The community loves both the site and the admins, but there are limits to the patience of users, and those limits are being tested by these outages. I would think the Reddit guys would be happy to have a scapegoat to direct the community's rage towards.
"Hi Mr. CIO of Amazon. You realize we had a 6 hour outage on one of the largest sites in the US because of you. We don't want to badmouth your service but we can. Care to pay more attention now?"
And it is pretty unprofessional. Ketralnis thinks he's defending his buddies but he isn't really helping the situation.
I know first-hand that when a team is working on things, and it's very publicly going wrong, and it's due to things that are out of your control, it's beyond frustrating to have everyone think that it IS your fault due to the public stance.
I'm guessing that the Reddit team that is dealing with this is more than a bit pissed that they can't be more public with what the real reasons are.
Personally, I could care less about the "professionalism" or political BS, I'd rather know the real reasons for the problems so that I could be better informed and not run into the same issues.
I'd rather see more candor and less PR.
Reddit isn't vital to anyone's well being, but it is a service that hundreds of thousands (?) use pretty regularly, so it certainly isn't trivial either.
How is he hurting the situation by calling out a service for what it really is?
They make it sound like they are already providing RAID or something similar; however, the fact that things like this happen to Reddit, who have built their own RAID on top of Amazon's already replicated volumes, show that reliability is not a good reason to go with AWS.
This isn't really any different from an on-premise application. An availability zone by definition implies "shared network hardware". Using multiple is what you do when you want redundancy.
If you had to run a site like Reddit, what would you do?
At this point in its life, reddit should have 6-12 programmers/system administrators + a few support staff, compared with the 3 they have at the moment.
That way they won't be agonizing over the choice between devoting their resources to keeping reddit running in the short term, or to moving reddit away from EC2 for long term stability.
How would you pay for that?
Of course, I've spent my life dealing with infrastructure, so running a rack or two (or ten) servers is going to be a lot cheaper/easier for me than it would be for someone without that experience. The economics for people unwilling to gain that experience will depend entirely on scale; e.g. do they have enough servers that the lower running cost of owning hardware would pay for someone to manage that sort of thing. (and yes, hiring other people has overhead in and of itself.)
Generally, as much as possible, I avoid building complexity. I find that having a single point of failure with a backup that can be manually brought in to place (such as an asynchronously replicated database) is quite often more reliable than fancy home-made SAN solutions. In general, you need to be /very careful/ of complex redundant systems. In fact, I approach it somewhat like crypto. As much as possible, I don't build it myself. use well-used open-source tools with well-known failure modes.
The other thing to think of is failure domains. Sure, think about single points of failure, but more importantly, think about what goes down if that single point fails. For instance, in my current setup, each xen host is a single point of failure... but one going down won't take down anything else. I've seen other people design similar systems with improvised SAN setups, thinking "oh, if one node dies, I'll boot the guests on another!" the problem is that if that san goes down, everyone is toast, while if one of my hosts goes down, we're talking maybe 1/40th of my customers who are out of action.
I've seen drbd setups where a guest mirrored itself locally and to a second server... It sounded like a great idea, but the system turned out to be less reliable than my dumb local storage setup, as weirdness in drbd and how drbd dealt with disk and network issues would cause lockups that were much more frequent than the hardware failures that would take down whole nodes in my local disk setup.
In fact, I have a very strong suspicion, born of hard experience and smoking pagers, of SANs that cost less than mid-sized bay-area condos. And I'm pretty cheap, so that means local storage for me. There has been a lot of activity in that field lately, so I'm very carefully exploring it again, but I certainly wouldn't count on some homemade san as being more reliable than local disk.
(to be fair, my experience lies with "expensive SANs, used expensive SANs, and homemade SANs." and my only good experience was with the first of those. I don't have a lot of experience with low-cost commercial SANs.)
Morever, I'm pretty suspicious of disk over the network schemes, even when the expensive "real" SANs are involved. NFS is the only scheme I really trust; it's older than I am. It has problems, but we know about all of them. Overall, I think NFS handles network blips much better than any of the block device over the network schemes I've used. And you do see network blips. The network simply isn't as reliable as your sata cable, and the block subsystem isn't designed to deal with devices that are temporarily unavailable.
I've seen a lot of 'clever' redundant setups built by people who are much smarter than I am... quite often, their setup ends up becoming less reliable than my "dumb" systems.
> I find that having a single point of failure with a backup
> that can be manually brought in to place (such as an
> asynchronously replicated database) is quite often more
> reliable than fancy home-made SAN solutions.
Murphy's law on St. Patrick's Day. Doesn't get any better than that.
I'm currently working on building a backend service that has to scale massively as well, and it has been a fun challenge trying to understand exactly where things can go wrong and how wrong they can go...
I know the community can be demanding, but that just seems stressful.
The comment about moving to local storage was interesting. Isn't the local storage on EC2 instances extremely limited (like 10-20GB?)
Those instances actually show the same problems, but they aren't too bad, because once you boot them, you don't need the root vol that much (that's what the instance storage is for).
Q1. I still don't get the use case for db storage on ephemeral storage.
Q2. If EBS is the problem why are you migrating to S3 backed EBS boot vols? The problem with this is still the time in between snapshots even though it will be shortened.
It will only be a matter of time before S3 disks and hardware start dying like EBS...en masse
I talked with Ketralnis several year ago and know how many VMs you were running back then. Pretty sure your not too far off from that count even today (even if 2x).
You can still virtualize on a good set of dedicated hardware to emulate your current 'network environment' to get you up and running in the near term _asap_. Obviously you'd build out of that vm environment (with your load) as the days go by. Seriously look into a parallel switch over though.
If EBS is in fact a huge issue as has been shown, you really may need to start migrating off unless you want dedicated employees monitoring system health on AWS. Eventually if problems continue that is what will happen, with no time left to even develop automation... And why automate on a pile of instability?
Don't forget that the more VMs you add with this high failure rate increases soft management costs and will eventually eat into your development time...
I don't work for Rackspace (I think they're quite expensive), but you guys might benefit from this level of care to focus on the real issues.
We're still not sure either, so we're investigating to see if it makes sense. One possible option will be to have the master on ephemeral disk with a hot backup on EBS so there is no data loss.
Another option is use ephemeral for the master and all but one slave, so we got hot backups without a slowdown.
Still need to look into it more.
The one that we are doing ephemeral right now is Cassandra with continuous snapshots to EBS. Everything in there can be recalculated, and with an RF of 3, if we lose one node we can run a repair.
> Q2. If EBS is the problem why are you migrating to S3 backed EBS boot vols? The problem with this is still the time in between snapshots even though it will be shortened.
They are just easier to use. The root volume is rarely accessed after it is booted, so the EBS slowdowns aren't really a problem in that case.
> Some Comments: It will only be a matter of time before S3 disks and hardware start dying like EBS...en masse
I don't think so. It is a totally different product built by a totally different team with a different philosophy. S3 was build for durability above all else.
In response to the rest of your comments, you are absolutely right, there are other options. We will certainly be investigating them.
#1 tip: Don't use threading. Python threading + EC2 will not work well. Instead rely on the OS doing the task switching and run multiple copies.
If you want more info, I did a talk at Pycon about this and other things: redd.it/b5jyy
EBS problems do seem to be the biggest reliability problem in EC2 right now. The most common symptom is that a machine goes to 100% CPU use and 'locks up'. Stopping the instance and restarting usually solves the problem.
The events also appear to be clustered in time. I've had instances go for a month with no problems, then it happens 6 times in the next 24 hours.
My sites are small, but one of them runs VERY big batch jobs periodically that take up a lot of RAM and CPU. Being able to rent a very powerful machine for a short time to get the batch job done without messing up the site is a big plus.
If you want to outsource who makes your lunch, fine, but if your whole business is requests in, data out, you do not put the responsibility of storing your data in someone else's hands.
I get it, Amazon EBS is cheap. But at the end of the day you've got to make sure it's your fingers on the pulse of those servers, not someone else who's priorities and vigilance may not always line up with yours.
(also the cloud is dumb)
If you read between the lines, this says that EBS lies about the result of fsync(), which is horrifying.
It's generally possible to fsync, then cut the power before the data is physically on the disk.
There's also a Postgres contrib module called pg_test_fsync that tests various fsync modes on your hard drive, and if you get too high results, you can strongly suspect that the dist is lying.
On bigger hardware, like SAN storage arrays, the (redundant) batteries keep the whole thing running for a while after the loss of power.
My personal opinion is that raid controllers are of little value without battery backed cache.
I just use MD in front of 'enterprise sata' drives, as without doubling my total cost for storage, I can't get a raid card, as far as I can tell, which is better than MD, and really, if I was to double my storage cost, simply buying twice as many spindles would get me better bang for my buck, I think, than a hardware raid card that cost that much.
Of course that completely shafts your performance, but hopefully you were expecting that.
More recently we also discovered that these disks [EBS volumes experiencing performance degradation] will also frequently report that a disk transaction has been committed to hardware but are flat-out lying.
There's not enough information to know if that's even possible or not, or how it might be possible, but as far as I know there aren't any known postgres issues that would cause that.
I would be extremely reluctant to store data on a network that sometimes didn't actually store the data without raising an error anywhere.
Jump to conclusions much?
Do you know if reddit had fsync enabled in first place?
It's very common to turn off fsync on a database for performance reasons.
It's far less common to have a network block device driver (which tend to be designed with intermittent outages in mind) lie about fsync.