I agree. I read this post and was shocked at the amount of planning, process, man-hours, hardware issues and other problems that come with hardware. I've worked at places with ~500 EC2 machines in a dozen autoscaling groups across 3 AZs with many ELBs, databases, SQS queues and other AWS infrastructure and never had to deal with anything like this when upgrading.
Upgrading hardware in EC2 is as simple as changing a launch configuration and updating an auto-scaling group. Maybe an hour of my time to update configs, verify and deploy. Updating something like a database or caching servers is more work for sure, but with 0 time needed to get to the DC, unpack, rack and configure servers you do save time with 'the cloud'.
I get that you do pay more for EC2 instances, especially if you keep hardware for 4 years. But AWS prices drop every year or two along with (generally) faster versions of software so your overall costs do drop.
How many ops employees would you need for a fleet of 500 servers in a datacenter? We managed it all with 4 people with AWS.
The Stack Exchange philosophy is that because they can buy truly mega hardware (each one of those two blade chassis they bought has 72 cores and 1.4TB of RAM, remember!), they don't need those 500 servers to start with. Plus the hardware is an asset and you get to depreciate it.
Everywhere I've ever worked we've had the "big spreadsheet" of projected cloud costs, projected ops costs, and hardware costs. In general the "scale horizontally" philosophy will favor the cloud while the "scale vertical" philosophy still seems to favor owned hardware in local datacenters. Which is superior is a crazy, long-standing debate with no clear answer.
Can you elaborate? I thought the answer to that question was to scale up if you can, because its much simpler and therefore cheaper. Similar to how you don't give up ACID unless the scale you're working at doesn't permit it anymore.
There's never really a "one size fits all" answer, which is why it's a long-running debate and depends heavily on the product.
Scaling horizontally can let you use smaller, cheaper hardware on average and burst to higher capacity more easily if you need to, at the expense of a lot of complexity. It also (done right, which is rare) tends to gain you a greater degree of fault tolerance, since hardware instances become rapidly-replaceable commodities.
Most web apps have spiky but relatively predictable load. For example, a typical enterprise SaaS startup gets more traffic during work hours than on weekends. For these companies the complexity of developing a horizontally scaled architecture can be offset by the decreased cost of buying really big machines for peak load and then scaling back to a couple small instances for periods of below-average load.
That's (ostensibly) why AWS exists in the first place: Amazon had to buy a lot of peak capacity for Black Friday and Christmas and found it going unused the rest of the year. They never meant to sell their excess capacity, but they realized the tools that they built to dynamically scale their infrastructure were valuable to others.
Plus, a lot of work is offline data analytics, ETL, and so on. It's very cost effective to scale these workloads horizontally on-demand - spin up extra workers to run your reporting each hour/night and keep costs down the rest of the time when you don't need the capacity.
On the flip side, companies like Stack Exchange and Basecamp have high, relatively stable traffic worldwide. For companies like this it makes more sense to scale vertically - if they were in the cloud, they would never scale down or shut down their instances anyway.
Personally, I agree that horizontal scalability is oversold and most people can, indeed, scale up instead of out. However, a lot of plenty smart people disagree with me and have valid reasons to scale horizontally, too.
> a typical enterprise SaaS startup gets more traffic during work hours than on weekends.
You still need to budget what you can get if renting dedicated hardware vs. renting virtual machines. For eg. a Dual Xeon X5670 machine w/ 96GB RAM and 4x480GB SSD can be had for $249 per month (just something random I found for demo purposes). Even if you do a reserved instance for a year on EC2, you can get a m3.2xlarge for this kind of money, and that's only 30GB RAM and 2x80GB SSD.
It might worth it to rent this sort of iron instead of spinning up and down EC2 instances especially if you can reasonably buy a large enough machine to cut a lot of headaches arising from distributed computing. The right tool for the job.
How many ops employees would you need for a fleet of 500 servers in a datacenter? We managed it all with 4 people with AWS.
This could be a false dichotomy. Just because a service with AWS uses so many servers doesn't mean a more monolithic system would need as many.
We did talks with one of our competitors (before they were a competitor). We mentioned that we ran our infrastructure on 4 large VM hosts (with a light density of 3-4 VMs per host). They were shocked. They were currently running over a hundred EC2 instances, with the relevant satellite services. They literally could not believe that we could provide a comparable service without reliance on something like AWS.
It's amazing what an be done with the right knowledge. In our case, myself and one of my coworkers maintain our infrastructure at something like 4-6 hours per week total (mostly patching, reviewing logs, etc). We both have previous networking, hardware, and software experience. When we do major upgrades (about every 2 years), it takes one of us about a week to source the hardware, get it loaded in to the rack, and turned on. Then we migrate guests over and we're done. This doesn't even get in to the cost savings on running on our own hardware vs AWS pricing.
We run about a hundred servers (soon to be lots more) with a part time staff of 3 (as in, we all do dev work most of the time). It used to be mostly me for ages, but we got big enough that I got promoted out of most of the day-to-day stuff.
All own hardware, and having just had a reboot on Softlayer's schedule to fix the Xen issue for a separate project we're running on their gear - being able to schedule your own maintenance windows is so much nicer. We spend less time dealing with problems on our own hardware than we do dealing with cloud providers having issues.
At my previous company, I built a dedicated hardware system that consistently delivered a sub-second response time at a cost of about a third of what it would cost to host on AWS.
After I left, the new CTO who replaced me migrated it Rackspace's cloud offering. I don't know the costs involved, but now the site averages an 8s load time.
You can't really beat the raw performance for the price of dedicated equipment.
Where I am currently working, thanks to good automation procedures there are only 3 people that are managing 4 datacenters on 3 continents with over 1000 virtual machines and also a couple of hundred physical servers. Those 3 people are 1 linux sysadmin, one network guy and one vmware guy. None of them works fulltime on maintaining the infrastructure just on patching/upgrading/installing new systems and that's 1/2 days a week at most.
I have now finished the plan for the migration of two datacenters and that process takes about 2 months with shipping/networking/configuring/installing machines.
I really don't understand the obsession of getting rid of ops / hardware guys and relying on Amazon/Google/CoolCloudProvider to handle everything.
I worked for a large scada company. We collected large amounts of data from thousands of large industrial installations.
One day we got a new VP who came from a well known firm who was a "cloud expert". He moved (nearly) all of our infrastructure to AWS, after producing untold amounts of spreadsheets/power points expressing how much cheaper/better/faster it's going to be.
Long story short, it was 4x more expensive as running it in house. By the time they went back to our own infrastructure, most of the internal sysops(including "The Glue" guy) had moved on and much of the old internal hardware was re-purposed or gone. It was a fiasco that they still have no fully recovered from yet.
I would be very careful in characterizing AWS as the solution for every large scale computer infrastructure problem.
Conversely, I have had excellent experiences with AWS in my current job, although we still have a rather large HPC cluster internally which would never make sense to move to AWS.
> How many ops employees would you need for a fleet of 500 servers in a datacenter? We managed it all with 4 people with AWS.
I'd say our goal is to keep growing and serving more content without _needing_ 500 servers in a data center. We are doing pretty well at that so far. We'll see what happens in the future.
The cloud is awesome if you have zero interest whatsoever in hardware. It's not without trade-offs (nor is the other direction), and too many for us - but if it works for you then great I say.
We obviously feel very different, and are just doing what works best for us.
most of the public cloud cheer-leading is just people rationalizing to themselves what an awesome decision they made deploying on aws onto a billion tiny instances or whatever. and for lots of folks, it probably is pretty awesome.
however, i've seen very, very few people compare actual before/after $ figures on hn. when it comes time to show your cards everyone gets cold feet, either because they 1. don't have any idea because they aren't the ones paying for it or 2. don't have a baseline for comparison and are just paying whatever amazon asks or 3. it ended up costing 2-4x as much on amazon
when your bill is $5k/month 2-4x isn't that big of a deal. when it's $100k+ a month, it becomes a really big deal.
Actually, occasionally you do need to think about it. Reference the various times AWS emails customers about unplanned outages that need to occur because of hardware issues/patching/etc.
Cloud is great, but there str still plenty of reasons to run your own datacenter. Yes for many startups it might not make sense, but at a certain sized company/application it can easily make sense.
AWS just announced a few days ago that their latest Xen patch will be deployed through a live update to their hypervisor kernel, and that going forward they expect patches like this to be rolled out live.
The real upside of AWS is that they have relentlessly pursued and killed off reasons for you to care about things like this. They've eliminated points of failure in their infrastructure and given operators a wealth of tools to ensure their apps stay up through any update or event (AZ-affine ELBs and autoscaling groups, single-IP ELBs, continuous improvements to EBS and S3, etc.) Given the scale of their infrastructure in us-east-1, it's now also highly unlikely that any customer will manage to overload it on their own.
I can't resist reminding you that 1 command from a sysadmin routing traffic to the wrong network was the cause of the last major outage there :)
They are getting much better, as all providers are. They're still just not a fit for many people because of performance requirements that are either impossible or too costly to meet on that type of infrastructure.
I've always said this: the cloud isn't a good fit for us; do what works for you.
> I can't resist reminding you that 1 command from a sysadmin routing traffic to the wrong network was the cause of the last major outage there :)
If you're going to bring that up, I can't resist reminding you of that time you had poor sysadmins running up and down stairs with buckets of fuel to keep your servers running[1].
> I've always said this: the cloud isn't a good fit for us; do what works for you.
There are costs. Frankly, it appears Stack Exchange prefers to lay those on its people rather than its purse.
As Kyle says, we were helping our sister company Fog Creek keep their servers online (as well as other people in that facility like Squarespace) because we cared. Our traffic was not being served from that data center and in fact we shut down most of our servers during that to conserve generator fuel. Our traffic was flowing just fine from Oregon. A decision Kyle and I made the night before when concluding they would probably shut down power to lower manhattan in preparation for flooding.
When your neighbor's house is on fire you don't argue over the price of the hose. You help. Our remote people that couldn't come help in person also helped them replicate their entire network in AWS as a backup plan.
I don't usually post pissed off comments, but you're dead wrong here and intetionally or not demeaning a good company and good people whom, because they cared, came to help in a time of emergency. I take it you weren't in New York during Sandy; it looked like a post-apocalyptic war zone afterwards.
It's sad that you can't handle a snarky comment (right or wrong) that was given in response to your own snarky comment. If you can't handle being poked and it pisses you off, don't poke others.
Also, you seem particularly offended that your altruism is being maligned, when you're also complaining that the GP was unaware that the action was altruistic in the first place (???).
That's hardly an argument. Netflix, consuming 33% of the nation's bandwidth is a great counterexample.
In terms of scale, GoGuardian (the company I co-founded) has passed Stack Exchange. Articles like this make me so happy to be on AWS. Delegating this work to AWS allows us to focus on the product instead of the hardware that it runs on.