I know it's not exactly in vogue these days to tout the merits of bare hardware, but.. after all the VPS hubbub over the last couple of years, the best progression for your website still seems to be:
1. No traction? Just put it anywhere, 'cause frankly, it doesn't matter. Cheapest reputable VPS possible. Let's say, Linode.
2. Scaling out, high concurrency and rapid growth? DEDICATED hardware from a QUALITY service provider--use rackspace, softlayer et al. Have them rack the servers for you and you'll still get ~3 hour turnarounds on new server orders. That's plenty fast for most kinds of growth. No inventory to deal with, and with deployment automation you're really not doing much "sysadmin-y" work or requiring full timers that know what Cisco switch to buy.
3. Technology megacorp, top-100 site? Staff up on hardcore net admin and sysadmin types, colocate first, and eventually, take control of/design the entire datacenter.
I simply don't understand why so many of these high-traffic services continue to rely on VPSes for phase 2 instead of managed or unmanaged dedicated hosting. The price/concurrent user is competitive or cheaper for bare metal. Most critically, it's insanely hard to predictably scale out database systems with high write loads when you have unpredictable virtualized (or even networked) I/O performance on your nodes.
reddit actually is a top 100 site, but we don't have nearly the need to host our own datacenter or co-locate. If we do make a move, it will be to #2. I don't want to hire people to be hands on -- I'd rather outsource that and let someone else pay to have spare capacity laying around.
Fair enough; 100 was sort of arbitrary, but there is some metric when you're sort of at google scale and cutting down on power costs or achieving absolute minimum latencies between datacenters has a meaningful impact on the bottom line. Draw a line somewhere, and yeah, Reddit is probably on the other side.
Generally with you, except for the Rackspace recommendation, been burned and pissed off by Rackspace too many times to ever use or recommend them again.
I tend to try to find at least 1 good local/regional datacenter for a good portion of a server stack. There is huge benefits (IMO) in being able to drive to your servers and have a face to face meeting if there are issues and/or take your toys and go someplace else if there is a massive outage. If you're in Green Bay, options might be limited, but in any semi-major metropolitan area there are usually enough datacenter options that you have multiple choices.
Another thought. If availability is your goal, with the trend toward 'operations as code', I think a small development team can build a system on top of AWS that can automatically respond to arbitrary node/resource/data-center failure. Netflix seems to do this to an extent, with their Chaos Monkey.
That being said, there are situations where you may truly need single-node performance that isn't available on AWS.
Assuming for a minute that Amazon deserves as much blame as ketralnis is heaping on here, why would the Reddit guys be so reluctant to point this out? Professionalism? Kindheartedness? Even professionalism and mutual respect have limits.
The community loves both the site and the admins, but there are limits to the patience of users, and those limits are being tested by these outages. I would think the Reddit guys would be happy to have a scapegoat to direct the community's rage towards.
Blaming a third party lacks class. The Reddit guys made the decision to rely heavily on EBS, and it came back to bite them. They show a lot of character by taking responsibility for an outage they had very little control over.
Reddit's no longer a cashstrapped startup with limited resources and options. It's ultimately their choice to stick with Amazon. If Amazon has been giving them the run around for more than a whole year, maybe it would've been a smart decision to move to something else.
I'm not sure that he really cares if he's helping the situation or not, seeing as he's no longer employed there.
I know first-hand that when a team is working on things, and it's very publicly going wrong, and it's due to things that are out of your control, it's beyond frustrating to have everyone think that it IS your fault due to the public stance.
I'm guessing that the Reddit team that is dealing with this is more than a bit pissed that they can't be more public with what the real reasons are.
Personally, I could care less about the "professionalism" or political BS, I'd rather know the real reasons for the problems so that I could be better informed and not run into the same issues.
On the other hand:
"Blaming a third party lacks class. The Reddit guys made the decision to rely heavily on EBS, and it came back to bite them. They show a lot of character by taking responsibility for an outage they had very little control over."
What do you think is the best solution to that one?
Indeed. You should always have a second and third hammer any time you swing, with an automatic failover to complete the job in case the first hammer explodes on contact or spontaneously ceases to exist.
I was :) I just hear the "a poor craftsman blames his tools" line often, but I'd bet you'd hear Michelangelo singing a different tune if his chisel hammer broke off at the handle and landed on his toe.
Amazon claims: "Each storage volume is automatically replicated within the same Availability Zone. This prevents data loss due to failure of any single hardware component".
They make it sound like they are already providing RAID or something similar; however, the fact that things like this happen to Reddit, who have built their own RAID on top of Amazon's already replicated volumes, show that reliability is not a good reason to go with AWS.
EBS isn't really RAIDed, it's virtualized block storage with replicas. The issue Reddit experienced wasn't drive failure, though, it was network degradation. The solution is to deploy redundant replicas in different availability zones (and/or regions, if you can). Reddit unfortunately wasn't built for that.
This isn't really any different from an on-premise application. An availability zone by definition implies "shared network hardware". Using multiple is what you do when you want redundancy.
Disk access was the single reason we couldn't go with AWS offerings. The speed/reliability of EBS just isn't where we need it to be for our database servers. I don't blame amazon for this-it's just a drawback of the their choice to go fully shared tenant and virtual.
Idle speculation is a bit useless, but it really makes you wonder why so many of their team have left recently, doesn't it? I wonder if the stress of keeping up with the increased load (and subsequent downtime) became too much for some of them.
Ah yes, the telltale sign of a large bureaucracy: people spending $150 of one kind of money to avoid spending $100 of a different kind of money (in this case, payroll budget versus tech-infrastructure budget).
Well, I'm a SysAdmin/hardware guy myself, so obviously, I'd buy my own hardware and run it. Once a month I'd spin up my off-site backups (that would be on ec2) to make sure I had somewhere to go if the shit hit the fan at my co-lo.
Of course, I've spent my life dealing with infrastructure, so running a rack or two (or ten) servers is going to be a lot cheaper/easier for me than it would be for someone without that experience. The economics for people unwilling to gain that experience will depend entirely on scale; e.g. do they have enough servers that the lower running cost of owning hardware would pay for someone to manage that sort of thing. (and yes, hiring other people has overhead in and of itself.)
Generally, as much as possible, I avoid building complexity. I find that having a single point of failure with a backup that can be manually brought in to place (such as an asynchronously replicated database) is quite often more reliable than fancy home-made SAN solutions. In general, you need to be /very careful/ of complex redundant systems. In fact, I approach it somewhat like crypto. As much as possible, I don't build it myself. use well-used open-source tools with well-known failure modes.
The other thing to think of is failure domains. Sure, think about single points of failure, but more importantly, think about what goes down if that single point fails. For instance, in my current setup, each xen host is a single point of failure... but one going down won't take down anything else. I've seen other people design similar systems with improvised SAN setups, thinking "oh, if one node dies, I'll boot the guests on another!" the problem is that if that san goes down, everyone is toast, while if one of my hosts goes down, we're talking maybe 1/40th of my customers who are out of action.
I've seen drbd setups where a guest mirrored itself locally and to a second server... It sounded like a great idea, but the system turned out to be less reliable than my dumb local storage setup, as weirdness in drbd and how drbd dealt with disk and network issues would cause lockups that were much more frequent than the hardware failures that would take down whole nodes in my local disk setup.
In fact, I have a very strong suspicion, born of hard experience and smoking pagers, of SANs that cost less than mid-sized bay-area condos. And I'm pretty cheap, so that means local storage for me. There has been a lot of activity in that field lately, so I'm very carefully exploring it again, but I certainly wouldn't count on some homemade san as being more reliable than local disk.
(to be fair, my experience lies with "expensive SANs, used expensive SANs, and homemade SANs." and my only good experience was with the first of those. I don't have a lot of experience with low-cost commercial SANs.)
Morever, I'm pretty suspicious of disk over the network schemes, even when the expensive "real" SANs are involved. NFS is the only scheme I really trust; it's older than I am. It has problems, but we know about all of them. Overall, I think NFS handles network blips much better than any of the block device over the network schemes I've used. And you do see network blips. The network simply isn't as reliable as your sata cable, and the block subsystem isn't designed to deal with devices that are temporarily unavailable.
I've seen a lot of 'clever' redundant setups built by people who are much smarter than I am... quite often, their setup ends up becoming less reliable than my "dumb" systems.
> I find that having a single point of failure with a backup
> that can be manually brought in to place (such as an
> asynchronously replicated database) is quite often more
> reliable than fancy home-made SAN solutions.
As someone who has invested heavily in implementing HA systems and then seen them be the root of increased downtime, I have come to the same conclusion. A simple fail-over system that will result in short downtime & require limited manual intervention is in many cases the best route.
I always love seeing a good technical post-mortem of what went wrong and how it could be fixed in the future...
I'm currently working on building a backend service that has to scale massively as well, and it has been a fun challenge trying to understand exactly where things can go wrong and how wrong they can go...
So, I assume instances reddit uses have instance storage root volumes instead of EBS root vols. I've always assumed the persistence of ebs AMIs was a plus without a downside. Why would you opt for instance-storage AMIs instead of ebs root volume EC2 AMIs?
Being that EBS booting only became an option in December 2009 I would not be surprised if Reddit had not migrated their instances to that boot/storage method. They acknowledged in the last two years they hadn't even had time to move one of their databases from a single EBS volume to striped EBS volumes.
Q1. I still don't get the use case for db storage on ephemeral storage.
Q2. If EBS is the problem why are you migrating to S3 backed EBS boot vols? The problem with this is still the time in between snapshots even though it will be shortened.
It will only be a matter of time before S3 disks and hardware start dying like EBS...en masse
I talked with Ketralnis several year ago and know how many VMs you were running back then. Pretty sure your not too far off from that count even today (even if 2x).
You can still virtualize on a good set of dedicated hardware to emulate your current 'network environment' to get you up and running in the near term _asap_. Obviously you'd build out of that vm environment (with your load) as the days go by. Seriously look into a parallel switch over though.
If EBS is in fact a huge issue as has been shown, you really may need to start migrating off unless you want dedicated employees monitoring system health on AWS. Eventually if problems continue that is what will happen, with no time left to even develop automation... And why automate on a pile of instability?
Don't forget that the more VMs you add with this high failure rate increases soft management costs and will eventually eat into your development time...
I don't work for Rackspace (I think they're quite expensive), but you guys might benefit from this level of care to focus on the real issues.
Thanks for the follow up jedberg, I was just guessing based on what has been publicly stated in the various blog posts over the last year or two. I used the same process for my own s3 -> ebs boot volume migration. That took a few weeks and I didn't have that many instances to migrate in the first place. Given the large number of instances reddit uses and the surprisingly small staff and one would reasonably expect that the migration would not be done.
Duly noted. I started with this talk and have been using it as a guide to scaling edge cases with python as well as AWS. I thought raid10 was overkill before I started digging into the postgres/EBS mess, but now it seem almost routine enough that Amazon should have it as a configuration option.
I had two machines running in east-1 last night and one of them went down around the same time reddit did. The other one made it through the night O.K.
EBS problems do seem to be the biggest reliability problem in EC2 right now. The most common symptom is that a machine goes to 100% CPU use and 'locks up'. Stopping the instance and restarting usually solves the problem.
The events also appear to be clustered in time. I've had instances go for a month with no problems, then it happens 6 times in the next 24 hours.
My sites are small, but one of them runs VERY big batch jobs periodically that take up a lot of RAM and CPU. Being able to rent a very powerful machine for a short time to get the batch job done without messing up the site is a big plus.
This is why you don't outsource your bread and butter, people!
If you want to outsource who makes your lunch, fine, but if your whole business is requests in, data out, you do not put the responsibility of storing your data in someone else's hands.
I get it, Amazon EBS is cheap. But at the end of the day you've got to make sure it's your fingers on the pulse of those servers, not someone else who's priorities and vigilance may not always line up with yours.
It's all a continuum. You could also build your own servers, or design specialized boards with dedicated processors optimized to your application. Everyone is going to choose a point on the continuum and each will have tradeoffs.
You're still outsourcing if you go with a managed dedicated hosting service, or even if you buy hardware and colocate it. Even if you owned the datacenter and the entire backbone, you're still banking on everyone else not fucking up their end of the connection.
Yeah. but even I don't use consumer hard drives in production. (Honestly, I don't know 100% that the 'enterprise' drives are that much better, but I'd guess they'd lie less... I switched because consumer drives tend to hang RAIDs when they fail, while 'enterprise' stuff fails clean)
Enterprise hardware lies about fsync, because even if you loose power, there is a little battery on the RAID controller that's enough to flush the cache to disk or to keep it for hours until the machine is powered up again. When the battery goes bad, the write cache is disabled automatically.
On bigger hardware, like SAN storage arrays, the (redundant) batteries keep the whole thing running for a while after the loss of power.
You could consider that the whole battery backup etc. means that it isn't lying about fsync. It says that it's been permanently written, and then it makes sure that is has. It might not have been burnt to spinning metal, but the system as a whole will ensure that it's permanent.
"enterprise" disk is just that; it's just disk. I've not seen a disk with an onboard battery. You can put a raid controller in front of your disks, with or without battery backed cache. (most raid cards have an optional battery module.) But, even many gold plated 'corporate' systems I've seen omit the battery, as it's usually not cheap.
My personal opinion is that raid controllers are of little value without battery backed cache.
I just use MD in front of 'enterprise sata' drives, as without doubling my total cost for storage, I can't get a raid card, as far as I can tell, which is better than MD, and really, if I was to double my storage cost, simply buying twice as many spindles would get me better bang for my buck, I think, than a hardware raid card that cost that much.
More recently we also discovered that these disks [EBS volumes experiencing performance degradation] will also frequently report that a disk transaction has been committed to hardware but are flat-out lying.
You can actually configure postgres to not do an fsync() after each transaction; instead it gets written to disk, which on Linux means written to the disk caches in RAM, then Linux flushes to disk when it does the rest of the disk writes (anywhere from 5 to 30 seconds are set as default on different versions of Linux).
So where is the education in jumping from "we have no proof to determine what happened" to an outrageous claim about EBS?
It's very common to turn off fsync on a database for performance reasons.
It's far less common to have a network block device driver (which tend to be designed with intermittent outages in mind) lie about fsync.