Cloud computing offers you the great and awesome advantages of being able to instantly scale your application, replicate your data and basically just grow according to your business volume, and all this without significant investments, delivery time, setup time, people time, maintenance or anything but it's expensive in the long run.
And this is OKAY, this is GREAT.
Once you're big enough, you know what your load is now and what your load will likely be, and you know exactly what you need now and (approximately) what you're going to need in the near future, setting up your own datacenter is way, way more effective.
Amazon does not get free electricity, free servers and/or free people time.
Of course, you're paying that, and you're also paying Amazon's profits.
This is absolutely fine, as long their service fits you.
But when you grow enough, put simply, your needs change. It's just that.
- machine latency varies more, you can't control it
- network latency varies more
- storage latency varies more (S3, Redshift, etc.)
- machine outages are more frequent
It's a lot harder to engineer cloud scale software to scale robustly and not degrade in latency when running on a large amount of nodes. For example, see 
Most of open-source cloud software does not come with these algorithms batteries included and it is not trivial to retrofit this kind of logic. Just being smart about loadbalancing won't cut it when at any given moment one of your nodes will become 10x slower than others even though your code is sound and in fact does not slow down like that.
In fact, what you lose in AWS convenience and "free" maintenance, you gain in simpler RPC/messaging/fault tolerance/storage infrastructure that can sometimes accommodate an order of magnitude more traffic or users on a machine then if deployed in AWS.
I wish it were true, but plenty of companies are gripped by cloud fever. I've seen quite a few going down the route of charging into the cloud not because they've run the numbers and found it stacks up, but because they want to be in the cloud, and Amazon have some great marketing people.
But I note that I have a unique case that isn't covered by the cloud "architecture" (crawling the web and indexing it). Explaining that it is "ok" if I am in the 10% not covered by your 90% solution. But sales folks never like to hear that.
A ton of places don't have a good team, and the cost numbers end up being a lot closer.
A great team will destroy Cloud margins. An Average-Bad team will fall in line, plus you don't have to run the team anymore.
But our operations costs are so much lower than renting that we could replace all our hardware every year and still break even.
(yeah, we could get SoftLayer to refresh our hardware every year as well, but we don't need it refreshed that fast, and at the end we still own the hardware)
If you have a great team, I firmly believe hosting yourself is far, far, far less expensive.
If you have a terrible team, then cloud (hosting) is less expensive. Even if it was exactly the same cost, you're gaining by not having to have a staff to run it, and the costs of managing them, etc etc.
Most places don't have great teams. Insert random corporation here likely has a team that is a mess for whatever reasons happen in large companies.
In that case, the Cloud makes a ton of sense for them. They've already screwed up their own organization in some way, and this is a large reset button on the whole thing.
That's worth a ton in itself.
You can get some cheap dedicated hosting (ex: ) for a fraction of the price.
It's so cheap compared to AWS you can order a few spare ones and still come out cheaper than your one beefy AWS instance ?
The only way it doesn't make sense is, if you need to scale up and down very fast ?
 60 euro/month: Quad-Core Haswell, 32 GB (non-ECC) RAM, 240Gb SSD @ https://www.hetzner.de
You can definitely save cost by subscribing to reserved instances, but the downside is you have to put down money upfront, which is very hard for many small players out there.
But watch out if you run data pipeline jobs - sometimes your so-called big data is really not that big. A few GBs daily report doesn't need to run on c3.xlarge instances. They can do just fine with a 24/7 m3.large instance. There was an article on HN a while ago about how one could run a custom report with shell commands on a commodity hardware, and get 100x times performance compare to running on EMR. You can also consider running most of your jobs on premise. Ihe network banwidth in/out is probably going to be cheaper than running all of your jobs on EMR. Direct connect is a great choice to boost the connectivity stability and security. Go for it.
Cloud is great for HA, because on Amazon you are encourage to build in multi-AZ and even multi-region. S3 is absolutely the de-facto today IMO for object storage. It's cheap and reliable. The learning curve for proper Amazon (or just about any cloud provider) is really deep. You can either end up like running Black Friday sales, or running like Netflix with monkey enjoying tea.
Running on cloud is no different than running on-premise, just you have to start all over again, because now you have to re-consider network, security, monitoring, and practice.
This isn't required anymore.
AWS introduced new Reserved Instances options few months ago, including a "no upfront" option which still gets you ~40% discount over on-demand prices.
I'm glad I left.
Back then it seemed most companies did it not because they had done the numbers, but because a few big names had done it and so the others did it to apparently piggy back on the stock markets attention.
On a side note, I feel like EC2 is simultaneously the best and worst service on AWS.
Are you? Amazon is not known for posting profits, and is quite possibly operating AWS at a loss.
They have promised to show financial results for AWS itself with the quarterly report being released next month, so we'll see.
Update: in FY2014, they sold $89BN ($70.1BN in products, $18.9BN in services), and their cost of sales was $62.8BN. I am pretty sure that their margins are razor-thin in retail, but the service side had higher margins.
They built it so they could run Amazon.com off the infrastructure, but then decided they could make money leasing their under-subscribed portions.
If they are not making a cut-and-dry profit from AWS (I'd fathom they are, they are one of the most expensive cloud providers and many other providers turn a healthy profit), then they are at least dramatically offsetting their own Amazon.com infrastructure costs which are mandatory anyway.
That doesn't mean the AWS part is in itself unprofitable. I'd wager it is, given how the business is somewhat mature and they have a huge chunk of that market.
1. Your operational overhead will increase _a lot_. Be ready to hire on a lot of ops staff if you expect them to do anything but put out fires. And as you grow you'll need experts, people like network engineers.
2. Any weirdness you experienced with AWS infrastructure will be replaced with weirdness in your own environment, except now you're on the line to troubleshoot and fix it yourself.
3. Operation staff will immediately start guarding the food bowl as resources become finite. Server provision waits start to seem like breadlines. Power is consolidated with Those Whom You Must Ask.
4. Your cost will decrease, sometimes significantly so.
5. You'll have more hardware flexibility to run your app just the way you want to (Stack Overflow's mega databases come to mind).
In the end I think this type of transition is for stable companies that don't mind or even prefer strong divisions of labor (coders who code, sysadmins who sysadmin, testers who test), but it's not for startups or companies that hope to move with any kind of strong velocity.
1) The operational overhead is the same as Amazon. Actually I think it's less because everything is so predictable. Every machine we get has identical performance to any other. We still don't have to care about the exact same list of things that EC2 provided but we also don't have to care about weird cloud issues.
2) See above answer but also Softlayer are happy to provide actual support for any issues in a timely manner. In general everything has been much more reliable that you actually have to think less about High Availabilty technologies that make your stack more complex. In the years I've been using them we have only had a few hard disk failures that were replaced in around an hour and we just failed over to a replicated slave manually for around 2-3 minutes of downtime.
3) Resources are no more finite than EC2. The only difference is that the provisioning time is 1 hour rather than 1 minute. That has still been perfectly fine to respond to unexpected load events.
5) Also great.
People often compare Cloud Hosting with Colo. What they should compare cloud hosting with is renting servers. You get 95% of the benefits of cloud with none of the drawbacks.
Why? What are you talking about? You are hiring servers you are no colo'ing them. Networking them is not your problem. Your responsibility still starts from a root prompt just there's no VM layer between that and the physical server.
And for cloud: we purchased video conversion as a service, that one ran in the cloud. I can see how that makes sense.
This is the worst part about moving out of the cloud, especially since cloud computing has moved a lot of ops and deployment responsibility to developers.
If you have lots and lots of money and a high margin business, do yourself a favor and go with Amazon (much less hassle with contract management and low level challenges).
If you need to scale month to month and are growing 50% per month, go with Amazon.
If you are very small and can live with 10 instances, go with Amazon.
If CAPEX doesn't help you and for whatever reasons you need to spend OPEX, go with Amazon.
If you need many (types of) machines for failover but which otherwise mostly idle, go with Amazon.
Otherwise it's always cheaper to buy or rent hardware. Amazon is very expensive (TCO).
If you base your decision on hype, you're screwed.
* Amazon stands for Cloud Provider, personally I'm choosing Digital Ocean with Mesos/Docker.
* Except S3 which is a no brainer to use.
Then rent more servers.
> If you are very small and can live with 10 instances,
Then rent a few servers.
I maintain that there are extremely few cases for a typical website to use the cloud. To handle peaks, it is both simpler and cheaper to keep enough capacity just idling around than spinning up and down Amazon instances. The cloud is almost always a useless hype. It can be different if you can architect to use the various services Amazon provides.
Also from my experience, with 10 instances the money you save with custom servers is negligible and contract and SLA management, multi datacenter etc. is easier with a cloud provider than renting servers. At least where I've rented
servers in the past.
Sure, the physical hardware takes 1 hour rather than 1 minute to spin up, but the process is otherwise entirely identical.
> To handle peaks, it is both simpler and cheaper to keep enough capacity just idling around than spinning up and down Amazon instances.
At a growing website, you have no idea how much "enough" is. Why try to estimate caps when you don't have to?
Why do I get the feeling it was kind of a cop-out to just pack up and move without finding the root cause? I've seen it plenty of times: the "best solution" is to just find a different hosting provider.
In my experience, I've never found an issue with an application on AWS that wasn't caused by either a misunderstanding of what was being offered (e.g. not provisioning enough PIOPS for database volumes), or simply issues with the application code.
You haven't been using Amazon long enough then.
Amazon is great for proof of concept. No upfront costs, extremely scalable, etc. Unfortunately, its expensive compared to physical hardware once you get to scale, and you may never solve underlying performance issues due to it being a shared tenant environment, even if you're a Netflix-sized customer.
Solving multi-tenancy issues is hard, but not impossible. I think it's a lot easier with live migration. If a box is giving you problems, just move the load to a new box while maintaining the same IP addressing.
With respect to cost, yes, AWS gets expensive at scale, but if you're at scale your servers are generally not your major cost center (it's usually payroll and licensing).
Of course, that design discipline is great wherever you are running....
Perfect should never be the enemy of the good.
Only in a world where resources are infinite does this work.
Its possible they get special treatment if they are big enough (nobody else's jobs on their physical machines ... or something like that).
And no, I don't think they're getting any real special treatment from Amazon since every single talk from a Netflix engineer points out the the cloud-specific issues they're solving in their infrastructure software.
It's definitely kicking the can down the road (eventually you have to build such that failing infrastructure is transparent to your eng team), but I still think it was the right decision at the time. YMMV obviously. :)
Of course, this only works if whatever you use it for allows for this.
However, in most cases not using EC2 as a stateless throwaway computing resources is simply a matter of bad infrastructure design.
So you're paying Amazon to do the same work you would do otherwise - only you're subject to their rules and procedures and Amazon being a profitable business needs to mark their services up.
> So you're paying Amazon to do the same work you would do otherwise - only you're subject to their rules and procedures and Amazon being a profitable business needs to mark their services up.
But I thought that they were paying Softlayer to do that stuff instead of Amazon. They're not doing it themselves - and yet it's still cheaper!
(Update) Re: failures - with a ~50 servers we see a hardware issue (disk dead in a RAID or an ECC memory failure) about once a month or so. None of those failures caused a single outage (RAID and ECC RAM FTW) so far.
Plus, you're still engineering your applications to be just as fault-tolerant as if they were running in cloud, right? The only difference is you are not paying the virtualization overhead tax. A single server dying should leave you in a no less redundant state than a single VM dying. They should also be nearly as easily deployable.
This is based off my personal experience in datacenters with 5,000-10,000 installed servers. Anything other than a PSU or HDD failure is exceedingly rare.
In fact over 4 years we have only had 3 hard drives fail and no other hardware failures.
I know at least when I've bought 20-30 servers at a time, I was able to get a lower cost than when I've only been buying one.
Just think of the simple supply-demand curve. As demand increases, price increases as well. The bulk discount pricing is only valid for amounts that provide better utilization of the supply chain. If Intel can produce 1M chips a month, then if somebody orders the last 50k, he might get a discount. If someone wants 2M, then he needs to pay a huge markup because the supply chain is not ready.
And Amazon is definitely big enough to move the equilibrium price up.
Because that's the only way this theory applies, if they're completely unable to meet the demand due to some specific shortage in the market.
The only reason your shares analogy makes sense is because there ARE a finite number of shares available at any given time, and buying too many drives up the cost in the entire market. Most manufacturers can scale up production as demand increases.
Let's say you have a factory that runs at 90% utilization and somebody crawls out of the woodwork who wants to order 3 factory-months worth of widgets, delivered next week.
Well first of all, you cannot meet that schedule, so you turn away the order in the instant case.
Now the question is: if we were to scale up production, what is the chance that some new person will crawl out of the woodwork with a similar instant order once the factories are ready? Because if we judge what has happened a one-off case, then we will refuse to meet the demand, whether it is real or not.
(Of course we're also making a lot of simplifying assumptions here like that you have access to capital, that there is no regulatory issue with scaling up production, that increased production does not open you to new lines of attack from your competitors, etc. Which are not good assumptions in general.)
It is our judgment of the demand, rather than the real demand, that controls production. If we are manufacturing, say, kevlar vests in 2001, we may very well interpret a large order as representing an underlying demand shift. On the other hand, if our widgets are luxury cars in 2008, we may interpret a similar set of facts as a one-off order.
The insight here is that real demand is not known at the time that supply is trying to meet it; it is estimated. The extent to which the market clears depends on how good the estimation is. With something like oil we understand demand fairly well, but in markets like consumer electronics the demand predictions are poor. That is why on the one hand Apple is chronically short of iPhones and simultaneously Amazon cannot give its phones away: all the estimates were off.
In short, the more your widget is impacted by technological or cultural shocks, the more likely it is that suppliers won't adjust to meet demand.
Really, microeconomics 101, all of us should have studied this in the first semester of any engineering degree :)
That seems like quite a snide remark.
I have in fact studied "economics 101," and while you're using basic economic theory to form your opinions, you're mixing that in with data which you've just created for the sake of supporting your original point.
Essentially you have no supportable reason to assume that supply cannot meet demand OR that Amazon cannot space out their demand/pre-warn the supply chain. Amazon could, for all we know, give them a 12 months lead time.
To be honest this entire conversation reminds me of that scene in Good Will Hunting when the guy in the bar is mouthing off about "market economy in the early colonies" because he just finished studying them last semester. Reading your posts comes across like you're trying to shoehorn in as much eco 101 knowledge as you can. And rather than provide data or any meaningful explanation for why you believe the market would go a certain way, you just shove in more econ 101 theory and hope for the best.
This post in particular lacks any substance, and is just trying to impress upon us how much econ 101 you know. But really I am more interested in why you believe the market wouldn't meet demand, rather than how many buzz words and theory names you can reproduce from your textbook.
They are just one of many large companies who buy hardware constantly.
Google, Microsoft, Facebook, Rackspace, Leaseweb, to name a few others...
If you need 10000 identical severs (ie exactly the same firmware versions, motherboards, hard drive version etc) then that is a bit of a pain since they can't just grab the next 10000 servers out of inventory and ship them to you. You have to make it as a separate special order.
Two problems here:
First off 10,000 servers almost certainly cost less than 100. Least of all because you can buy direct from the OEM rather than through a reseller (who profits), and also because the buyer has more leverage for negotiations (that's a lot of money, and they COULD go elsewhere).
Second problem: The servers don't need to be identical, and in fact Amazon's EC2 instances aren't identical (they just pretend to be). If you spin up several EC2 instances over a few weeks then look at e.g. the CPU info, you'll see that they vary quite a lot but are similar-ish (this has caused people issues when they're using on-demand instances and their software relies on specific CPU features, in particular when those features only exist on current-gen CPUs).
PS - Also 10,000 is not even ballpark how many physical servers Amazon has (try 450,000).
> When it comes to labor cost - if you have enough hardware for at least one full time datacenter tech, you're in the same boat as Amazon.
I highly doubt that. Amazon's scale allows them to develop better automation, detection, and procedures in general which allows the number of staff per server to be very low. For example, a single dedicated tech' might be able to handle 10-30 servers MAYBE, whereas at Amazon that might be just a single rack and effectively each tech might be responsible for hundreds of physical machines (even if automation does the lion's share of the heavy lifting).
I will fully admit that a company like SoftLayer (per the article) can give Amazon's EC2 a run for its money. However as someone who's seen the costs associated with running servers in house (in particular staffing costs) I struggle to buy that you can under-cut Amazon by doing so (at least until you have a LOT of servers, and even then frankly it is less hassle to out-source it anyway).
There are legitimate arguments for why you'd want to do so e.g. privacy, security, legal reasons, unique hardware/OS, etc. However if you're just doing something generic like web-host+database, then out-sourcing it to a dedicated company is more cost effective. In particular when you start looking at the hidden costs of internal hosting (like office space, heating/electricity, security, and so on).
Largely the way to efficiently use amazon is to turn of nodes, when not needed for traffic. That is the service you are paying for.
With Ansible, I spend no more than an hour a week, amortized, maintaining both the hardware and administration. I assume nodes for any specific role will fail, I only scale horizontally, I always have redundancy for every role, I stay off disk as long as possible (heya 512GB RAM redis cluster), etc.
What I miss in this article is any details on why they had issues with AWS. You can't just say it wasn't reliable and not explain the details. AWS works for all of the world's largest startups, why didn't it work for Swiftype?
I run an environment that scales to around 1,000 EC2 instances daily. Primarily we run C3.2Xlarge and R3.2xlarge for the core of our application.
We have ~12 nodes in our mongo cluster, and havent had a single issue with these nodes.
I occasionally get a zombie (totally hung VM) but thats very infrequent. I was aggressively using spot instances previously, but have switched to all 12-month reservations (We would lose many machines to a spot outage, new machines - more than those on Richess) and the recovery time for our system is 35 minutes (due to the R3 boxes needing to download their in-memory index from other machines) - so our service is degraded in capacity until the relaunch of these machines completes.
[aside: if youre looking to use spot, do two things - over-provision by a factor of 1.8 and spread across zones, and go look into using ClusterK.com for their balancer product]
Anyway, Just curious what was causing "sometimes daily" outages - I can't imagine that this would be due to AWS and not lacking ability of your application to handle instance losses.
I ask because other than the VM security updates, none our instances have these sort of issues and some of them have a VERY long life (not ideal we know). I understand the cost savings and the rest of the reasoning but in my experience EC2 isn't THAT unreliable.
And the only solution provided by EC2 support was always to buy more instances to keep them cold and happy. The problems with that approach (just to name a few): the cost (for a young startup burning money on idle infrastructure like that is not very wise IMO) and the fact, that the time to design, develop and deploy scale-out approach for each of your backend services is the time you could have spent trying to build your product (again, startup-specific; you'll have to think about across-the-board 100% scalability at some point).
In my limited experience, I've rarely faced any significant issues with EC2. I assume your threshold for issues must have been very lower than mine.
And, as I mentioned in the article, we could always order new boxes for any of our clusters and get them online within a couple hours, so we are able to scale up pretty quickly if needed.
It is like many other things involved in running a technology company. Investing in automation can pay off hugely.
The newer instance types are very reliable too (in my experience).
For load balancing we have moved to a Route53 (health checks and round-robin) + a group of nginx+haproxy+lua-based frontend boxes.
Everything else was either built in-house or used open-source components and wasn't really tied to EC2 infrastructure.
Re: Rackspace and other providers – based on my real-life practical experience with a few of the largest providers in the States, SL quality of services and their provisioning speed are miles away from competitors could offer. So it was a no-brainer to go with SL and I'm happy we did.
eg. Netflix probably spins up thousands of servers for a few hours.
Unfortunately cloud computing, or at the very least AWS, overpromises and underdelivers at scale. All in all economically viable use cases for cloud computing are very few and very specific at scale.
The only real solution is to move the bandwidth usage off AWS.
When you get to a point where you feel like this whole thing is going to fly, I'd recommend starting to think if paying the "cloud tax" (resources spent around EC2 stability issues and the cloud-specific stuff) a good idea in your particular case. There are some companies that benefit greatly from the elasticity of the cloud (the ability to scale up and down along with their specific load demands), but many companies aren't like that. If your traffic is relatively stable and predictable (you do not have 10-100x traffic surges) and your infrastructure load does not grow linearly with the traffic, using real hardware over-provisioned to handle 2-5-10x traffic spikes without huge decrease in performance may be a better idea in terms of the cost.
Of course, you could start the company based on all of the PaaS magic sauce (databases, queues, caches, etc) provided by Amazon nowadays and only use EC2 to run your application code (AFAIU that's the ideal use-case for AWS) and just kill misbehaving nodes when an issue occurs, but then you need to factor AWS costs into your business plan because migrating away from a PaaS is almost impossible at any large scale, so you are going to stay with Amazon for a very long time.
There's three types of workloads that make sense to run on EC2:
a) Extremely spiky/seasonal loads (batch jobs, event/campaign traffic)
b) Loads that can be structured as to run entirely from spot-instances (worker-pools)
c) Loads so small that the markup versus rented/dedicated hardware just doesn't matter
Maybe I'd just add one more case here: some users are OK with locking themselves up to AWS by treating it as a platform from the day one and building on top of AWS database/queue/etc services. For those people using EC2 just to run the app code and replacing instances when they misbehave may be a good idea.
And to your point somewhere else in here, it is a hell of a thing to try and move away from that platform. Yeah, it's super easy to beat EC2 on cost of compute resources, but really if you're running everything yourself on top of compute at AWS then you're doing it wrong.
A lot of AWS services can be used by real hardware though, so it's not all or nothing.
For example, where I work we use S3 to store an archive of files but keep the working set of data cached on our web servers which are at codero.
We have video rendering servers which turned out to be much cheaper to do with a cluster of desktop-class hardware in a server closet at our office as opposed to the server grade GPU instances on EC2. The monthly cost of a single GPU instance at EC2 is more than the total cost of the hardware off of newegg.
However, for outages we have a script that spins up GPU instances on EC2 which is much more economical than having a separate set of servers somewhere just in case.
AWS is a success because there are no upfront costs, it lets you scale up very quickly, and you don't need in-house hardware expertise to maintain your machines. People are willing to pay a premium for these advantages.
1) you can't get new hardware delivered and get it up and running, all in under 10 minutes.
2) also, assuming the previously gathered hardware is not needed anymore, you can't just return it and say "i used it only for two days because i had a traffic spike, take this 200$ and we're okay.
3) you can't programmatically install, configure, reinstall and reconfigure hardware configurations, networking and services on phisycal services. At least, not as easily.
Many others, but these are very valid points.
Of course, Amazon is not the solution to all of the problems you could ever have, but still it solves a great deal of problems.
2) SL has hourly physical server rental now (turn up is quoted at 20-30 minutes though)
3) SL has an API for ordering changes, you can setup a script to run on first boot (and probably system images too). What are you thinking for network configuration? Really the only thing I've had to configure on SL is port speed (somewhat API accessible, but not if they need to drop in a 10G card/put you on a 10G rack, etc), and disabling the private ports (API accessible, real time changes).
With AWS I can scale up easy, not have to worry about doing things like replacing failed hard disks, and most importantly I can be in multiple geographic sites for no additional cost. That to me right now is worth a 50% premium as the cost for doing that would be higher than that savings.
I think if you reach a certain scale, and have predictable usage, it is not a bad thing to setting up cabinets in 2 or 3 locations. We have found too that a lot of Colos are getting bought up and then will not lease you a few cabinets. They want to sell only to people who want a cage, or entire room. It is hard for small to medium sized businesses.
Colocation is a very different beast and I certainly would not encourage anybody to do that until a very large scale when rented hardware economics stop working for them.
Sorry, but that is mostly a lie.
Running a non-trivial app on EC2 is significantly more complex than doing the same on (rented) bare metal. Scaling to a massive size can be easier on EC2, but only after you paid a significant upfront cost in terms of dollars and development complexity.
Is your app prepared to deal with spontaneous instance hangs, (drastic) temporary instance slowdowns, sudden instance or network failures?
Did you know that ELBs can only scale up by a certain, sparsely documented amount per hour?
Or that you need a process to deal with "Zombie" instances that got stuck while being added/removed to ELBs (e.g. the health-check never succeeds).
Or that average uptime (between forced reboots) for EC2 instances is measured in months, for physical servers in years?
Or that Autoscaling Groups with Spot instances can run out of instances even if your bid amount is higher than the current price in all but one of the availability zones that it spans?
The list of counter-intuitive gotchas grows very long very quickly once you move an EC2 app to production.
This is a surprise to me, given that I work at an AWS shop doing things other people would call "DevOps". AWS doesn't automate provisioning or provide a (worthwhile) deployment pipeline, andAWS doesn't react (except in crude and fairly stupid ways) when something goes wrong or out-of-band.
No, they don't. They provide the tools, its still up to you to orchestrate it.
But wouldn't that apply also to SoftLayer?
The main benefits of Amazon is that it:
a) allows you to scale down i.e. buy services in smaller portions than complete physical servers
c) integrated features
You could probably pay for one devops position once your infrastructure gets to 10 physical servers.
The best resource to research different providers are the webhostingtalk.com forums. You can also contact me and I will do my best to advise you based on your desired criteria.
*Full Disclosure: I'm the founder/owner of a dedicated hosting company
I could go on and on about those, but other options were even more painful.
I've been running three 32GB servers (each with 3TB storage) with them for 2+ years now and the only outage I've experienced is the switch (5 port GBit) dying once. Hetzner tech replaced it in under an hour.
These three servers cost me €263/month (that's total, not each). Included in that monthly price is an additional IPv4 for each server, a private 5port Gbit switch, remote console access and 300GB of DC backup space.
There are probably better deals available now (i.e. more RAM at the same price) than the one I'm on since it's old and not offered on their site any more (/makes note to self to call Hetzner sales)
I wouldn't want to do it, which is why I'd rather work for somebody who'd pay for AWS, but I think there's a thing in there somewhere for those who want to dig.
1. Make your own hardware
2. Own/manage your own hardware
3. Rent commodity hardware from a standard hosting provider
4. Use IaaS (e.g. EC2)
5. Use PaaS (e.g. AppFog, Nodejitsu)
6. Use BaaS (e.g. Firebase, PubNub, Pusher.com)
The higher the level, the more technical flexibility you lose.
The bigger the company, the more it makes sense operate at a lower level because there is no significant wastage being introduced as you move down the levels (you remove the middlemen so you can pocket their profits) and the capital cost to move between levels is relatively low.
If you compare software to another industry like cheesemaking for example, if you're a cheesemaker and you want to make your own milk, the next step is to buy the whole farm and then you have to figure out what do do with the meat (wastage). Going between these two levels is expensive and could mean doubling or tripling your expenses so it's not an easy move to make.
Of course, some of this has to do with your team's skill level, but I've had clients run up $100k+ monthly bills at AWS with a relatively small build-out. (and, wow, VPC migrations..)
For fixed or predictable growth patterns on a mature app/platform, a slow build-out on real iron will generally be significantly less expensive, all other things being equal.
However, there are other advantages to AWS that gets lost in this story, such as pre-built, highly scalable datastores. Comparing EC2 to real iron misses most of the real story on why the cloud is changing everything.
One of the hardest things I have to tell clients is not to build their own datastore/database in house or at EC2; sometimes the case is clearcut, and sometimes not so much, but if you have a datastore at AWS that gives you 80% of what you need, use it instead of rolling your own. (source: IAMA AWS Growth Architect)
Health checks and failover are must have now, but this article makes me wonder three things:
1) Are there any DNS services that understand geography of your "zones", i.e. route to and failover based on IP? (but are still platform agnostic).
2) How long can a DNS failover take worst case? You can technically set a low TTL, but don't a lot of ISPs just increase that to a minimum?
3) Isn't it better to replace some of the DNS failover with high availability dedicated load balancing?
2) Yes, some ISPs do set a minimum TTL. Although BGP anycast is the most effective as the first line, sometimes it makes sense to have your reverse proxy caching layer override that distribution based on GeoIP and redirect to a more suitable proxy node closer to the client. This is especially the case when people using recursive lookup DNS servers that aren't necessarily geographically close to them (e.g. 220.127.116.11). It could also be useful in cases where TTL expiration hasn't caught up yet though.
3) No. Think of BGP Anycast DNS as distribution at a global level, and dedicated load balancers as distribution at the local level. You need to work out how to get the traffic to the load balancer first, and load balancing across distant geographies (high latency) results in horrible performance.
DNS TTL is not as big of an issue today as it was 5-10 years ago, when idiotic ISPs were trying to save on DNS resolving by ignoring TTLs. Nowadays you see an almost perfect drop in traffic when switching off a load balancer. Only bots and some weird exotic ISPs may keep sending traffic to a disabled box for up to an hour or two, but since DNS LB is only used to handle real emergency outages and for planned maintenance we could move LB IPs around, I really do not see it as a big enough issue to stop using the DNS LB magic :-)
Twitter, Mozilla and lots of other big names use them. I remember watching a webcast where Mozilla said they used Dyn's anycast failover service, with TTLs on their domains set to 5 seconds.
I've been using their DynECT entry level package ($30/month) for a couple of years and it's great.
Edit: you might also find this comment from an old thread interesting/useful: https://news.ycombinator.com/item?id=7813589 (go up two levels to phil21's first comment - HN isn't giving me a direct link sadly)
- We're starting from scratch and think AWS will give us flexibility for cheap
- We have existing servers and think moving from them to AWS will give us flexibility for cheap
Cases where people move away from EC2:
- It was slow/unreliable
- It was expensive
Conlusion: You should use AWS EC2 in order to save money and have flexible resource allocation, but don't expect it to be stable or cost-effective.
I used to be able to set a watch by HP's cycle before the EDS buyout.
It doesn't always get publicity. I work for a major company, and we don't scream from the roof tops about it.
I'd be very interested what problems you were having with them and at what scale. If this is a private topic, we could do it over email or some other medium if you like. You can contact me by any of the means listed here: http://kovyrin.net/contact/
Aside from that it was mainly nit-picky type stuff, but still things that were annoying (networking issues between DCs, networking issues between pods, internal mirrored apt-get repos going out of sync, API is kind of blah, etc).
We use docker so having a few bare metal machines with tons of containers on them wasn't a great HA setup (for us at least), even running in two data centers. The fairly quick setup time though was a nice selling point.
When we went to AWS things just kind of worked. The API was easier to use and the GUI portal was way nicer/stable. So far we have not had any odd issues with our instances, but we also typically run them at about 50% capacity so that might be why. It is also still early so maybe things will come up in 6+ months that send us back to SL :)
Anyone offer competition to VPC?