Hacker News new | past | comments | ask | show | jobs | submit login
How and Why Swiftype Moved from EC2 to Real Hardware (highscalability.com)
252 points by quicksilver03 on March 16, 2015 | hide | past | favorite | 189 comments

There's not much need for a fancy article on a fancy website in order to understand a key concept of cloud computing:

Cloud computing offers you the great and awesome advantages of being able to instantly scale your application, replicate your data and basically just grow according to your business volume, and all this without significant investments, delivery time, setup time, people time, maintenance or anything but it's expensive in the long run.

And this is OKAY, this is GREAT.

Once you're big enough, you know what your load is now and what your load will likely be, and you know exactly what you need now and (approximately) what you're going to need in the near future, setting up your own datacenter is way, way more effective.

Amazon does not get free electricity, free servers and/or free people time. Of course, you're paying that, and you're also paying Amazon's profits.

This is absolutely fine, as long their service fits you.

But when you grow enough, put simply, your needs change. It's just that.

The reason why it is extremely hard to engineer robust large scale AWS cloud apps can be summarized under the umbrella of performance variance:

  - machine latency varies more, you can't control it

  - network latency varies more

  - storage latency varies more (S3, Redshift, etc.)

  - machine outages are more frequent
where more can be an order of magnitude more variation than on bare metal deployments. I am not saying the performance is that much worse, only that it will unpredictably vary for a certain instance. The interference is non gaussian and can happen in bursts as opposed to easy-to-model-and-anticipate white noise.

It's a lot harder to engineer cloud scale software to scale robustly and not degrade in latency when running on a large amount of nodes. For example, see [1]

Most of open-source cloud software does not come with these algorithms batteries included and it is not trivial to retrofit this kind of logic. Just being smart about loadbalancing won't cut it when at any given moment one of your nodes will become 10x slower than others even though your code is sound and in fact does not slow down like that.

In fact, what you lose in AWS convenience and "free" maintenance, you gain in simpler RPC/messaging/fault tolerance/storage infrastructure that can sometimes accommodate an order of magnitude more traffic or users on a machine then if deployed in AWS.

[1] http://research.google.com/people/jeff/latency.html

Great comment! Thanks a lot for sharing this.

> There's not much need for a fancy article on a fancy website in order to understand a key concept of cloud computing:

I wish it were true, but plenty of companies are gripped by cloud fever. I've seen quite a few going down the route of charging into the cloud not because they've run the numbers and found it stacks up, but because they want to be in the cloud, and Amazon have some great marketing people.

It is interesting coming at it from the other side, I get hit up all the time to move my infrastructure to the cloud, I give them the specs and what I pay per month, tell them that if they can beat it I'll switch. The closest anyone has come has been about 4x the cost I'm currently paying.

But I note that I have a unique case that isn't covered by the cloud "architecture" (crawling the web and indexing it). Explaining that it is "ok" if I am in the 10% not covered by your 90% solution. But sales folks never like to hear that.

Yeah, it's always fun when I tell someone at we're 100% self-hosted too. They never stop to think that traffic, load, and usage patterns on craigslist are pretty well understood at this point. Plus it's certainly cheaper than paying someone else to run all your hardware. You get to pick your peers, pick your hardware, and allocate as you see fit.

It's cheaper if you have a good team, which it sounds like you do.

A ton of places don't have a good team, and the cost numbers end up being a lot closer.

A great team will destroy Cloud margins. An Average-Bad team will fall in line, plus you don't have to run the team anymore.

Your comment makes sense, but it implies Netflix does not have a great team, or that it has other considerations than pure financial ones?

I've had many clients on AWS rack up six-figure monthly bills unnecessarily simply because they didn't realize how to design for cost.. it's not a silver bullet. Those penny fractions add up fast. There are a lot of tricks to save money.

We run into this at FastMail all the time as well. For the one customer who needed "their own gear" we built out on SoftLayer for much the same reasons as the article - real hardware where you need it.

But our operations costs are so much lower than renting that we could replace all our hardware every year and still break even.

(yeah, we could get SoftLayer to refresh our hardware every year as well, but we don't need it refreshed that fast, and at the end we still own the hardware)

Often times companies move because their organization doesn't deliver, and the hope is a cloud company will do a better job of it.

If you have a great team, I firmly believe hosting yourself is far, far, far less expensive.

If you have a terrible team, then cloud (hosting) is less expensive. Even if it was exactly the same cost, you're gaining by not having to have a staff to run it, and the costs of managing them, etc etc.

Most places don't have great teams. Insert random corporation here likely has a team that is a mess for whatever reasons happen in large companies.

In that case, the Cloud makes a ton of sense for them. They've already screwed up their own organization in some way, and this is a large reset button on the whole thing.

That's worth a ton in itself.

If you're a small org with relatively small volume self-hosting doesn't make sense.

Can you elaborate why not ?

You can get some cheap dedicated hosting (ex: [1]) for a fraction of the price.

It's so cheap compared to AWS you can order a few spare ones and still come out cheaper than your one beefy AWS instance ?

The only way it doesn't make sense is, if you need to scale up and down very fast ?

[1] 60 euro/month: Quad-Core Haswell, 32 GB (non-ECC) RAM, 240Gb SSD @ https://www.hetzner.de

The cost of the systems is almost irrelevant. The time of qualified people is far more expensive. Sot it's not just the elasticity. You have to look everything from the time perspective as well. What additional knowledge will I need? Will I need to learn load balancers, how to make them highly available? will I have to learn about SSL certificates and termination? Will I need to learn how to implement and operate a secure and highly available DNS service? How much is your time worth? Our hosting costs are a fraction of a single worker's salary. Whether AWS is more expensive is essentially irrelevant. OP argues that it was not doing the job for them which is an entirely different issue.

Don't forget the quality of the work. Frankly, if I'm administering servers I'm not going to do as good a job of it as someone whose whole job is that and I'm not going to do as good a job as Azure or AWS or whoever else either. Platform as a service is the way to go.

Hetzner is fantastic but if you're not in Europe, what's your latency like?

There's OVH that has a big data center in Montreal.

Because if your sysadmins are all moonlighting developers having them not develop stuff to do subpar system work is more expensive than just paying the cloud premium.

Reset button is worth a ton in itself. This is a really nice way of looking at it.

From personal experience this is absolutely true. The initial cost to migrate and learn the cloud practice is unbelievable, but once you have the process in place running on Amazon CAN save some cost down the road, including the hours you need to replace hardware. You will build tools or use existing tools to create your infrastructure and operation process. A generous estimation to reach a level of maturity is 1.5 year. For ever-growing business this cloud fever is acceptable.

You can definitely save cost by subscribing to reserved instances, but the downside is you have to put down money upfront, which is very hard for many small players out there.

But watch out if you run data pipeline jobs - sometimes your so-called big data is really not that big. A few GBs daily report doesn't need to run on c3.xlarge instances. They can do just fine with a 24/7 m3.large instance. There was an article on HN a while ago about how one could run a custom report with shell commands on a commodity hardware, and get 100x times performance compare to running on EMR. You can also consider running most of your jobs on premise. Ihe network banwidth in/out is probably going to be cheaper than running all of your jobs on EMR. Direct connect is a great choice to boost the connectivity stability and security. Go for it.

Cloud is great for HA, because on Amazon you are encourage to build in multi-AZ and even multi-region. S3 is absolutely the de-facto today IMO for object storage. It's cheap and reliable. The learning curve for proper Amazon (or just about any cloud provider) is really deep. You can either end up like running Black Friday sales, or running like Netflix with monkey enjoying tea.

Running on cloud is no different than running on-premise, just you have to start all over again, because now you have to re-consider network, security, monitoring, and practice.

"You can definitely save cost by subscribing to reserved instances, but the downside is you have to put down money upfront (...)"

This isn't required anymore. AWS introduced new Reserved Instances options few months ago, including a "no upfront" option which still gets you ~40% discount over on-demand prices.


But if the alternative to reserved instances is buying servers then the paying upfront thing isn't so much of an issue.

My old company moved, after I left, from a dedicated system I built that provided less than 1s response times to a cloud system that averaged 8s response times, because the company focus changed from maximizing performance to maximizing sizzle buzzwords.

I'm glad I left.

Sounds akin to the outsourcing binge that went on a decade (?!) or so back.

Back then it seemed most companies did it not because they had done the numbers, but because a few big names had done it and so the others did it to apparently piggy back on the stock markets attention.

in fact it is related to outsourcing. in the past outsourcing processing involved shipping drives of data overseas. The risk and schedule hit were big, the offshore companies had crap hardware that usually cost a lot more than it cost in the US. That quieted down for a while. The new cloud stuff got these people interested again in processing outsourcing, only now the data is shipped to google.amazon/whoever and the outsource employees use the cloud vms to run the processing. I wonder how long before this mini trend starts slowing down when they finally figure out that management overhead and poor communication is in most cases the main problem with outsourcing.

That's cause you need to use the feature of your cloud. Run lots of servers and shoot them, when you dont need them. That's why the cloud is better than anything else. But most people just rent their xl3.large instance.

If only it were that simple... But every second relatively young engineer I interview points out that "But Netflix! Look how good the cloud works for them!" and once again I need to explain, that Netflix spends millions in engineering resources cost to handle EC2 issues and we are not (almost nobody is) Netflix (yet?).

Netflix also runs many of their own servers, and puts their own racks into many data centers.



Amazon also runs Netflix racks

Oh thank goodness. I thought I was the only one.

Not sure what you mean with "datacenter". If you mean using co-location or similar it probably works out as a small price difference. If you start with an empty room or an empty piece of ground it might cost you a lot to start building everything and hire people to manage it all. Remeber that the big cloud providers can get a scale benefit where a lot of virtual computers share a small operations team. I do think that physiccal servers usually are faster than cloud computers meaning you might be able to use fewer of them to compensate for the price difference.

I see there are already some contrarian-because-I-feel-like it replies, but this is a really succinct way to put it. Of course there are exceptions and edge cases, but EC2 is all about elasticity. If that doesn't appeal to you, don't touch EC2. If you aren't using the "E" in EC2, there are far cheaper, more performant alternatives.

On a side note, I feel like EC2 is simultaneously the best and worst service on AWS.

>you're also paying Amazon's profits.

Are you? Amazon is not known for posting profits, and is quite possibly operating AWS at a loss.

They have promised to show financial results for AWS itself with the quarterly report being released next month, so we'll see.

Amazon is not know for posting operating profits, because they plow funds back into new research and businesses. But they most definitely operate with gross margins on everything they sell.

Update: in FY2014, they sold $89BN ($70.1BN in products, $18.9BN in services), and their cost of sales was $62.8BN. I am pretty sure that their margins are razor-thin in retail, but the service side had higher margins.

A Motley Fool article claims that while AWS is successful, it isn't profitable, mainly because of the accounting behind capital expenditure. Last year Amazon spent more in interest than their entire operating income. http://www.fool.com/investing/general/2015/02/04/amazon-just...

Profit for something like AWS is likely difficult to put a number on.

They built it so they could run Amazon.com off the infrastructure, but then decided they could make money leasing their under-subscribed portions.

If they are not making a cut-and-dry profit from AWS (I'd fathom they are, they are one of the most expensive cloud providers and many other providers turn a healthy profit), then they are at least dramatically offsetting their own Amazon.com infrastructure costs which are mandatory anyway.

Actually, they built AWS as a business on its own right from the start and only later moved Amazon.com onto AWS.


If AWS posts a loss it'll be because Amazon is reinvesting resources into its growth, not because they're subsidizing each user's costs.

The reason they're not making profits as a company is they're spending the money on new projects.

That doesn't mean the AWS part is in itself unprofitable. I'd wager it is, given how the business is somewhat mature and they have a huge chunk of that market.

Here's my anecdotal, one-data-point experience from moving a giant EC2 environment to datacenter:

1. Your operational overhead will increase _a lot_. Be ready to hire on a lot of ops staff if you expect them to do anything but put out fires. And as you grow you'll need experts, people like network engineers.

2. Any weirdness you experienced with AWS infrastructure will be replaced with weirdness in your own environment, except now you're on the line to troubleshoot and fix it yourself.

3. Operation staff will immediately start guarding the food bowl as resources become finite. Server provision waits start to seem like breadlines. Power is consolidated with Those Whom You Must Ask.

4. Your cost will decrease, sometimes significantly so.

5. You'll have more hardware flexibility to run your app just the way you want to (Stack Overflow's mega databases come to mind[0]).

In the end I think this type of transition is for stable companies that don't mind or even prefer strong divisions of labor (coders who code, sysadmins who sysadmin, testers who test), but it's not for startups or companies that hope to move with any kind of strong velocity.

[0] http://highscalability.com/blog/2014/7/21/stackoverflow-upda...

Contrast that to my experience with renting from Softlayer:

1) The operational overhead is the same as Amazon. Actually I think it's less because everything is so predictable. Every machine we get has identical performance to any other. We still don't have to care about the exact same list of things that EC2 provided but we also don't have to care about weird cloud issues.

2) See above answer but also Softlayer are happy to provide actual support for any issues in a timely manner. In general everything has been much more reliable that you actually have to think less about High Availabilty technologies that make your stack more complex. In the years I've been using them we have only had a few hard disk failures that were replaced in around an hour and we just failed over to a replicated slave manually for around 2-3 minutes of downtime.

3) Resources are no more finite than EC2. The only difference is that the provisioning time is 1 hour rather than 1 minute. That has still been perfectly fine to respond to unexpected load events.

4) Absolutely

5) Also great.

People often compare Cloud Hosting with Colo. What they should compare cloud hosting with is renting servers. You get 95% of the benefits of cloud with none of the drawbacks.

Can you walk me through a concrete example of cost savings? Every time I look at something like SoftLayer, it doesn't seem to offer much of a savings compared to a corresponding EC2 reserved instance. Sure, you can do way better than EC2 on-demand instances, but the reasonable comparison is to a reserved instance.

I think a big part of it is that dedicated metal and "reserved instances" are not apples-to-apples. You want absolute control, consistency, and customization of the build, you can buy/lease metal and co-locate it, or you can rent it from someone like SoftLayer, and you still don't ever have to walk into a data center.

Inter-DC network traffic is not rated/charged for SoftLayer. The key cost differentation is network charges when contrasting SoftLayer other services.

> Your operational overhead will increase _a lot_. Be ready to hire on a lot of ops staff if you expect them to do anything but put out fires. And as you grow you'll need experts, people like network engineers.

Why? What are you talking about? You are hiring servers you are no colo'ing them. Networking them is not your problem. Your responsibility still starts from a root prompt just there's no VM layer between that and the physical server.

Yes, it really does look like @stephen-mw is talking about a migration to an owned/colocated hardware. And that's a very different kind of hairy mess I'd prefer to stay away from as long as humanly possible.

OMG, colo. I've been the architect of a Top 500 website acquired by a Top 100 and even for that we just rented a few servers (you'd be surprised how few -- a dozen or so). I do not really know how big you need to be make colo a sensible choice but quite probably much, much bigger than most think of. We ran the numbers and they weren't pretty.

And for cloud: we purchased video conversion as a service, that one ran in the cloud. I can see how that makes sense.

and if you have more than one? or more than one dc? somebody needs to connect it, or you will need a VPN. it's not that easy without Clouds if you need connected servers. We switched to AWS since connecting servers in a dc isn't as cheap as people think of.

I host all my stuff with Hetzner (who are at the cheap end of the market compared to Softlayer) and even they provide pre-configured VLANS and private switches. For intra-DC you can just setup some redundant openvpn links or pay the dc to configure a hardware tunnel between both sites.

We get free inter-DC connectivity for free with Softlayer. There really is no difference for us between connecting to a server within a DC or between datacenters (aside from added latency, of cource) – the same private address space, the same access rules (we see all of our machines in a private backend network from any datacenter).

This article was talking about renting servers from Softlayer. All servers on your account in Softlayer have a private VLAN that connects them, even between data centers. Private networks for your servers are a feature of many server hosts.

Every enterprise router in existence has built-in tunnel support. Even the open source ones like PFSense.

Operation staff will immediately start guarding the food bowl as resources become finite. Server provision waits start to seem like breadlines. Power is consolidated with Those Whom You Must Ask.

This is the worst part about moving out of the cloud, especially since cloud computing has moved a lot of ops and deployment responsibility to developers.

I would say an absolute pre-req for moving from Amazon to own servers would be tooling like VCAC or OpenShift to give people a nice self-service experience.

Then you get "out of capacity" errors even sooner. You could have a pretty bad experience with an internal cloud that was self-service but wasn't run like a service on the business side, so the fixed annual budget was exhausted immediately.

My 5cent:

If you have lots and lots of money and a high margin business, do yourself a favor and go with Amazon (much less hassle with contract management and low level challenges).

If you need to scale month to month and are growing 50% per month, go with Amazon.

If you are very small and can live with 10 instances, go with Amazon.

If CAPEX doesn't help you and for whatever reasons you need to spend OPEX, go with Amazon.

If you need many (types of) machines for failover but which otherwise mostly idle, go with Amazon.

Otherwise it's always cheaper to buy or rent hardware. Amazon is very expensive (TCO).

If you base your decision on hype, you're screwed.

* Amazon stands for Cloud Provider, personally I'm choosing Digital Ocean with Mesos/Docker.

* Except S3 which is a no brainer to use.

> If you need to scale month to month and are growing 50% per month,

Then rent more servers.

> If you are very small and can live with 10 instances,

Then rent a few servers.

I maintain that there are extremely few cases for a typical website to use the cloud. To handle peaks, it is both simpler and cheaper to keep enough capacity just idling around than spinning up and down Amazon instances. The cloud is almost always a useless hype. It can be different if you can architect to use the various services Amazon provides.

From my experience renting more servers with 50% growth is a challenge. A lot of things go wrong when installing a lot of servers each month.

Also from my experience, with 10 instances the money you save with custom servers is negligible and contract and SLA management, multi datacenter etc. is easier with a cloud provider than renting servers. At least where I've rented servers in the past.

I'm not sure where you see a difference between a VM and physical hardware when it comes to provisioning.

Sure, the physical hardware takes 1 hour rather than 1 minute to spin up, but the process is otherwise entirely identical.

1 hour? Welcome to 2015! OVH spins your server up in two minutes.

Why rent more servers when it's so much easier to spin them up as you need them in AWS and tear them down (and no longer pay for them) when you don't. If you're expecting unknown change in scale, it's so much easier to be able to just spin up servers to keep up than anything else.

> To handle peaks, it is both simpler and cheaper to keep enough capacity just idling around than spinning up and down Amazon instances.

At a growing website, you have no idea how much "enough" is. Why try to estimate caps when you don't have to?

Absolutely, 100% agree. Thanks for the comment!

"With Amazon we experienced networking issues, hanging VM instances, unpredictable performance degradation (probably due to noisy neighbors sharing our hardware, but there was no way to know) and numerous other problems. "

Why do I get the feeling it was kind of a cop-out to just pack up and move without finding the root cause? I've seen it plenty of times: the "best solution" is to just find a different hosting provider.

In my experience, I've never found an issue with an application on AWS that wasn't caused by either a misunderstanding of what was being offered (e.g. not provisioning enough PIOPS for database volumes), or simply issues with the application code.

> In my experience, I've never found an issue with an application on AWS that wasn't caused by either a misunderstanding of what was being offered (e.g. not provisioning enough PIOPS for database volumes), or simply issues with the application code.

You haven't been using Amazon long enough then.

Amazon is great for proof of concept. No upfront costs, extremely scalable, etc. Unfortunately, its expensive compared to physical hardware once you get to scale, and you may never solve underlying performance issues due to it being a shared tenant environment, even if you're a Netflix-sized customer.

You can use abstraction layers to isolate yourself from issues with the underlying metal. For example, I had a good thread the last time the maintenance reboots happened: https://news.ycombinator.com/item?id=9120289.

Solving multi-tenancy issues is hard, but not impossible. I think it's a lot easier with live migration. If a box is giving you problems, just move the load to a new box while maintaining the same IP addressing.

With respect to cost, yes, AWS gets expensive at scale, but if you're at scale your servers are generally not your major cost center (it's usually payroll and licensing).

That is what I like about the cloud. Running in your own hardware lets you be, well, "lazy" about application architecture. Running in a place where it is shared and instances can disappear forces you to design a lot more robustly and nimbly.

Of course, that design discipline is great wherever you are running....

Hardware is almost always cheaper than engineering time. Don't optimize prematurely.

Yes, but engineering is cheaper than running after real-time problems.

Now we get to the technical debt debate. Sometimes, you have to make good decisions now instead of perfect decisions later. The market doesn't care how elegant your code is.

No debate, I agree with everything you have said. But sometimes you have to clean that garbage up because it does matter to the market.

The market cares about uptime, and cost relative to return.

Perfect should never be the enemy of the good.

> Perfect should never be the enemy of the good.

Only in a world where resources are infinite does this work.

But doesn't Netflix successfully use AWS now?

Its possible they get special treatment if they are big enough (nobody else's jobs on their physical machines ... or something like that).

Yep, that is why I specifically noted in the article, that given enough resources, it is possible to survive and even thrive in the cloud (Netflix being one of the best examples of that), but in case of a startup it is not aways the best idea to keep burning money and engineering resources when your primary job is to keep building a business.

And no, I don't think they're getting any real special treatment from Amazon since every single talk from a Netflix engineer points out the the cloud-specific issues they're solving in their infrastructure software.

Netflix uses thousands of instances, they don't share servers.

Using thousands of servers doesn't mean that you don't share physical servers.

Even if you own all the VMs on a physical server it's hard to avoid the noisy neighbour problem. Netflix has its own instance monitoring tool - Vector - to handle these issues: http://techblog.netflix.com/2015/02/a-microscope-on-microser...

I've seen and fought issues with hanging and/or stalling EC2 instances, and the decision was made to move to physical hardware - it was one of those tradeoff choices you have to make about paying N dollars to throw money at hardware vs. M dollars at person-hours to investigate.

It's definitely kicking the can down the road (eventually you have to build such that failing infrastructure is transparent to your eng team), but I still think it was the right decision at the time. YMMV obviously. :)

I've had the odd problem with hanging/stalling instances before I learned to design for the cloud. Now those instances automatically get replaced with new ones and nobody would even notice if I didn't pipe the notice through to Slack.

Of course, this only works if whatever you use it for allows for this.

However, in most cases not using EC2 as a stateless throwaway computing resources is simply a matter of bad infrastructure design.

This is a no-brainer if you've ever done anything at scale. The explanation is rather simple - hardware is always "on the premise", yours or Amazon's. Someone needs to swap drives, motherboards, man the networking gear, run cables, etc. Amazon doesn't really get a break on the hardware cost because 10,000 servers do not cost less per server than 100 servers (in fact they cost more as the volume goes up if you need them to be identical). When it comes to labor cost - if you have enough hardware for at least one full time datacenter tech, you're in the same boat as Amazon.

So you're paying Amazon to do the same work you would do otherwise - only you're subject to their rules and procedures and Amazon being a profitable business needs to mark their services up.

> The explanation is rather simple - hardware is always "on the premise", yours or Amazon's. Someone needs to swap drives, motherboards, man the networking gear, run cables, etc.

> So you're paying Amazon to do the same work you would do otherwise - only you're subject to their rules and procedures and Amazon being a profitable business needs to mark their services up.

But I thought that they were paying Softlayer to do that stuff instead of Amazon. They're not doing it themselves - and yet it's still cheaper!

I would like to know the cost calculation after a year or two. With a handful of servers it's easy to get the false impression that HW failures are rare.

Oh, there wasn't a handful of servers after we finished the migration (we have migrated a bit late IMO, so we had a lot of traffic even back then). And today, with much larger infrastructure, with hardware clusters specifically tailored to our customers needs, etc I'm pretty sure the same infrastructure on EC2 would cost more than 2x.

(Update) Re: failures - with a ~50 servers we see a hardware issue (disk dead in a RAID or an ECC memory failure) about once a month or so. None of those failures caused a single outage (RAID and ECC RAM FTW) so far.

I ran several dozen Dell blade enclosures fully maxed out - well over 300 server blades - and in 3 years I had two disk failures, none of which were critical. Hardware is pretty reliable these days.

How do you monitor HW and network failures and how do you notify SoftLayer? Is that 1-2 hours replacement time true for each components of your server fleet?

1-2 hours is their new server provisioning time. For HW issues we use nagios (that checks raid health and ECC memory health regularly) and at the moment we just file a ticket with SL about the issue showing them the output from our monitoring. They react within an hour and HW replacement is usually performed within an few hours after that (usually limited by our ability to quickly move our load away from a box to let them work on it).

HW failures are rare. At least hardware failures that matter. Disks in a RAID set dying or redundant power supply failures are not critical, and even those are more rare than you would generally expect them to be. With a bit of standardization it's incredibly cheap to keep a pool of spares handy and RMA the failed components at a leisurely pace.

Plus, you're still engineering your applications to be just as fault-tolerant as if they were running in cloud, right? The only difference is you are not paying the virtualization overhead tax. A single server dying should leave you in a no less redundant state than a single VM dying. They should also be nearly as easily deployable.

This is based off my personal experience in datacenters with 5,000-10,000 installed servers. Anything other than a PSU or HDD failure is exceedingly rare.

We have 100 physical servers and hardware failures really are very rare. Very rare.

In fact over 4 years we have only had 3 hard drives fail and no other hardware failures.

Do you have any plans on replacing your hard drives as they get old? Or you just wait for them to fail?

Generally I'd say the extra cost savings come from the lack of software needed to support the amazon style apis. They may have also made a multi year commitment, which would also further drive the cost down.

Why would 10,000 servers cost more than 100 servers? It seems like if you are buying all of your parts in bulk they are going to be cheaper. I'm pretty sure Intel's pricing on CPUs are cheaper by the tray than individually.

I know at least when I've bought 20-30 servers at a time, I was able to get a lower cost than when I've only been buying one.

Prices go down until to a certain amount, then then start to increase. Analogy: you want to buy shares of company X. If you buy 1, brokerage costs are high compared to your investment. If you buy 100, you still get the shares from the top of the order book, and fees become negligible. Buy 10M, and you will pay much-much more per share because supply is not going to be there.

Just think of the simple supply-demand curve. As demand increases, price increases as well. The bulk discount pricing is only valid for amounts that provide better utilization of the supply chain. If Intel can produce 1M chips a month, then if somebody orders the last 50k, he might get a discount. If someone wants 2M, then he needs to pay a huge markup because the supply chain is not ready.

And Amazon is definitely big enough to move the equilibrium price up.

I see no explanation for why suppliers wouldn't match demand. Aside from the HDD shortage which hit everyone, I've seen no issues like you're describing where essentially the market runs out of servers/CPUs/etc.

Because that's the only way this theory applies, if they're completely unable to meet the demand due to some specific shortage in the market.

The only reason your shares analogy makes sense is because there ARE a finite number of shares available at any given time, and buying too many drives up the cost in the entire market. Most manufacturers can scale up production as demand increases.

There are a lot of reasons why suppliers might not match demand.

Let's say you have a factory that runs at 90% utilization and somebody crawls out of the woodwork who wants to order 3 factory-months worth of widgets, delivered next week.

Well first of all, you cannot meet that schedule, so you turn away the order in the instant case.

Now the question is: if we were to scale up production, what is the chance that some new person will crawl out of the woodwork with a similar instant order once the factories are ready? Because if we judge what has happened a one-off case, then we will refuse to meet the demand, whether it is real or not.

(Of course we're also making a lot of simplifying assumptions here like that you have access to capital, that there is no regulatory issue with scaling up production, that increased production does not open you to new lines of attack from your competitors, etc. Which are not good assumptions in general.)

It is our judgment of the demand, rather than the real demand, that controls production. If we are manufacturing, say, kevlar vests in 2001, we may very well interpret a large order as representing an underlying demand shift. On the other hand, if our widgets are luxury cars in 2008, we may interpret a similar set of facts as a one-off order.

The insight here is that real demand is not known at the time that supply is trying to meet it; it is estimated. The extent to which the market clears depends on how good the estimation is. With something like oil we understand demand fairly well, but in markets like consumer electronics the demand predictions are poor. That is why on the one hand Apple is chronically short of iPhones and simultaneously Amazon cannot give its phones away: all the estimates were off.

In short, the more your widget is impacted by technological or cultural shocks, the more likely it is that suppliers won't adjust to meet demand.

No, they can't scale demand that quickly. And time matters. Also, there's no situation where demand can't be met (well, extremely rarely it happens, in case of inelastic goods). What happens is you increase prices until the demand lowers to match the supply. You never run out of products. Similarly, oil supplies will last forever, we won't ever run out of fossil fuels. Simply it will not be worth using them because of the price.

Really, microeconomics 101, all of us should have studied this in the first semester of any engineering degree :)

> Really, microeconomics 101, all of us should have studied this in the first semester of any engineering degree :)

That seems like quite a snide remark.

I have in fact studied "economics 101," and while you're using basic economic theory to form your opinions, you're mixing that in with data which you've just created for the sake of supporting your original point.

Essentially you have no supportable reason to assume that supply cannot meet demand OR that Amazon cannot space out their demand/pre-warn the supply chain. Amazon could, for all we know, give them a 12 months lead time.

To be honest this entire conversation reminds me of that scene in Good Will Hunting when the guy in the bar is mouthing off about "market economy in the early colonies" because he just finished studying them last semester. Reading your posts comes across like you're trying to shoehorn in as much eco 101 knowledge as you can. And rather than provide data or any meaningful explanation for why you believe the market would go a certain way, you just shove in more econ 101 theory and hope for the best.

This post in particular lacks any substance, and is just trying to impress upon us how much econ 101 you know. But really I am more interested in why you believe the market wouldn't meet demand, rather than how many buzz words and theory names you can reproduce from your textbook.

You make it sound as if Amazon buying servers was some kind of unexpected freak event.

They are just one of many large companies who buy hardware constantly.

Google, Microsoft, Facebook, Rackspace, Leaseweb, to name a few others...

Since most vendors source their parts from two or three different places you'll often find that even though you ordered 2000 'identical' computers, they'll have for example two or three different makes of hard drives in them, and sometimes different Bios versions and RAM configurations (2x8GB instead of 4x4GB for example)

If you need 10000 identical severs (ie exactly the same firmware versions, motherboards, hard drive version etc) then that is a bit of a pain since they can't just grab the next 10000 servers out of inventory and ship them to you. You have to make it as a separate special order.

> Amazon doesn't really get a break on the hardware cost because 10,000 servers do not cost less per server than 100 servers (in fact they cost more as the volume goes up if you need them to be identical).

Two problems here:

First off 10,000 servers almost certainly cost less than 100. Least of all because you can buy direct from the OEM rather than through a reseller (who profits), and also because the buyer has more leverage for negotiations (that's a lot of money, and they COULD go elsewhere).

Second problem: The servers don't need to be identical, and in fact Amazon's EC2 instances aren't identical (they just pretend to be). If you spin up several EC2 instances over a few weeks then look at e.g. the CPU info, you'll see that they vary quite a lot but are similar-ish (this has caused people issues when they're using on-demand instances and their software relies on specific CPU features, in particular when those features only exist on current-gen CPUs).

PS - Also 10,000 is not even ballpark how many physical servers Amazon has (try 450,000).

> When it comes to labor cost - if you have enough hardware for at least one full time datacenter tech, you're in the same boat as Amazon.

I highly doubt that. Amazon's scale allows them to develop better automation, detection, and procedures in general which allows the number of staff per server to be very low. For example, a single dedicated tech' might be able to handle 10-30 servers MAYBE, whereas at Amazon that might be just a single rack and effectively each tech might be responsible for hundreds of physical machines (even if automation does the lion's share of the heavy lifting).

> So you're paying Amazon to do the same work you would do otherwise - only you're subject to their rules and procedures and Amazon being a profitable business needs to mark their services up.

I will fully admit that a company like SoftLayer (per the article) can give Amazon's EC2 a run for its money. However as someone who's seen the costs associated with running servers in house (in particular staffing costs) I struggle to buy that you can under-cut Amazon by doing so (at least until you have a LOT of servers, and even then frankly it is less hassle to out-source it anyway).

There are legitimate arguments for why you'd want to do so e.g. privacy, security, legal reasons, unique hardware/OS, etc. However if you're just doing something generic like web-host+database, then out-sourcing it to a dedicated company is more cost effective. In particular when you start looking at the hidden costs of internal hosting (like office space, heating/electricity, security, and so on).

This is not entirely true. Certain components pricing decreases as the volume goes up, but this is likely a different scale than softlayer.

Largely the way to efficiently use amazon is to turn of nodes, when not needed for traffic. That is the service you are paying for.

Yeah, and you pay McDonald's to cook the burger for you -- but you can't do it cheaper than they can.

A concrete example of how I saved a few hundred thousand dollars over the AWS by building quarter rack colocation setups with SuperMicro servers: https://gist.github.com/rockymadden/5561377

With Ansible, I spend no more than an hour a week, amortized, maintaining both the hardware and administration. I assume nodes for any specific role will fail, I only scale horizontally, I always have redundancy for every role, I stay off disk as long as possible (heya 512GB RAM redis cluster), etc.

Great resource. Are you going to keep this updated?

From my experience AWS is one of the most reliable hosting providers. It's extremely easy to setup a fault tolerant infrastructure using cloudformation, puppet/chef etc, boto and a bit of autoscaling. The only disadvantage AWS has is costs. In many cases it's more expensive than a traditional hosting provider. On the flip side, your engineers will be more productive.

What I miss in this article is any details on why they had issues with AWS. You can't just say it wasn't reliable and not explain the details. AWS works for all of the world's largest startups, why didn't it work for Swiftype?

Author here. Happy to answer any questions.

Can you give morespecifics about what you were running on and what you purchased for your own gear?

I run an environment that scales to around 1,000 EC2 instances daily. Primarily we run C3.2Xlarge and R3.2xlarge for the core of our application.

We have ~12 nodes in our mongo cluster, and havent had a single issue with these nodes.

I occasionally get a zombie (totally hung VM) but thats very infrequent. I was aggressively using spot instances previously, but have switched to all 12-month reservations (We would lose many machines to a spot outage, new machines - more than those on Richess) and the recovery time for our system is 35 minutes (due to the R3 boxes needing to download their in-memory index from other machines) - so our service is degraded in capacity until the relaunch of these machines completes.

[aside: if youre looking to use spot, do two things - over-provision by a factor of 1.8 and spread across zones, and go look into using ClusterK.com for their balancer product]

Anyway, Just curious what was causing "sometimes daily" outages - I can't imagine that this would be due to AWS and not lacking ability of your application to handle instance losses.

Using EC2 here for nearly 2 years and you mention I/O problems and instance outages 2-3 times a week. Which size instances were you running?

I ask because other than the VM security updates, none our instances have these sort of issues and some of them have a VERY long life (not ideal we know). I understand the cost savings and the rest of the reasoning but in my experience EC2 isn't THAT unreliable.

Oh, I know what you're talking about. We too had some instances (actually, a lot of those) that would run for a year with no issues. The problems started around the time you tried to push EC2 instances beyond an "idle, handling some requests just to keep from falling asleep" state. Pushing IO (even with provisioned IOPS) caused random IO stalls, pushing CPU caused REALLY uneven performance, etc.

And the only solution provided by EC2 support was always to buy more instances to keep them cold and happy. The problems with that approach (just to name a few): the cost (for a young startup burning money on idle infrastructure like that is not very wise IMO) and the fact, that the time to design, develop and deploy scale-out approach for each of your backend services is the time you could have spent trying to build your product (again, startup-specific; you'll have to think about across-the-board 100% scalability at some point).

First: I work in Startup BD at AWS (disclosure), but have been a multi-time founder as well. I was under the impression that an AWS architect will sit with you to optimize your infrastructure (Business Support). Did that not happen / or was it not useful? Happy to help in any way I can.

What size instances were you using in EC2 that were having performance degradation, and what kind of specs did the real hardware have that you moved to?

If you can provide more details on the kind of EC2 instances you were using and the kind of issues that you consider issues that would be great.

In my limited experience, I've rarely faced any significant issues with EC2. I assume your threshold for issues must have been very lower than mine.

How do you handle spikes in request volume? That is, one of the nice things about working in the cloud with dynamic sizing is that your costs should only be relative to your average load, not necessarily your peak load. Given the size of Swiftype, and the fact that you back tons of individual sites (instead of being your own site), there might be enough variance in the sites you back so that your peak and average load is pretty similar. For a single ecommerce site, though, they might get a huge peak over baseline if they do a big marketing push, for example. In that case, they might have to provision many more physical servers than they usually need to handle that peak. Just wanted to see if this issue came up in your planning.

Honestly, just as it is with many SAAS companies at some scale, we do not need care about any specific customer's traffic anymore. Simply because we get so much traffic already from our existing customers, that none of the new customers could generate enough to cause any significant blip on the radar. If a customer comes to us with some specific requirements (like being able to index 100MM documents with some specific response time guarantees), we build dedicated pieces of infrastructure for them, load-test it all and provide those guarantees. All of the others are placed in their own pools which have enough capacity to handle 3-5x of ALL of our current traffic with no issues, so any single customer would not be able to generate enough load to cause problems.

And, as I mentioned in the article, we could always order new boxes for any of our clusters and get them online within a couple hours, so we are able to scale up pretty quickly if needed.

Did you reserve your instances? Were you using current generation instances (c3, m3, etc). Did you try to take advantage of traffic patterns to scale up and down the number of instances you were running?

We had reserved instanced and regular ones, we did not see any patterns in stability issues between those. Re: instance types - I do not really remember which instances we were using to be honest. And as for the scaling up and down - we had a hard time keeping it all up as it was, we did not want to spend resources trying to make it work with constantly changing node pools (though I understand, that it would push us to building are more robust infrastructure able to handle random node outages, we had a business to build and wanted to focus on the product instead of creating a perfectly scalable application for an early stage application).

Well I ask because reserved instances can significantly reduce the price of ec2 (up to something like 75%). Also just turning off idle instances can save a ton. If you invest the time in doing those things I think you can beat other options and so in that way, cloud infrastructure can be very economical.

It is like many other things involved in running a technology company. Investing in automation can pay off hugely.

The newer instance types are very reliable too (in my experience).

If you were using old generation instances (t1 m1 c1 etc.) your experience would have been VERY different to current generation instances (c3 c4 m3 t2). Did you try network optimized instances with (very) low latency networking?

Did you try dedicated instances?

Other than ec2, what AWS services were you using and how did you migrate those? At minimum, I'd guess you were using elb for load balancing, sqs for queues and elastic cache for redis.

We are still a loyal customer for some of their services. For example, we still use S3 for off-site backups and Route53 is still our primary DNS provider.

For load balancing we have moved to a Route53 (health checks and round-robin) + a group of nginx+haproxy+lua-based frontend boxes.

Everything else was either built in-house or used open-source components and wasn't really tied to EC2 infrastructure.

Did you seriously consider a traditional colo or other vendors like SoftLayer (e.g., Rackspace)? It seems like at some point your reasoning here will apply to a colo if you grow bigger.

Colo - that's an option I'll try to stay away from as long as it is humanly possible. All of my experiences with colo hardware caused a lifetime of pain so that I'm happy to be paying SL a premium for their outstanding services (I'm a huge fan of Softlayer as you have probably guessed).

Re: Rackspace and other providers – based on my real-life practical experience with a few of the largest providers in the States, SL quality of services and their provisioning speed are miles away from competitors could offer. So it was a no-brainer to go with SL and I'm happy we did.

I'm convinced hybrid cloud is the way to go. Anything needing high IO performance should be on dedicated. Anything needed CPU/memory elasticity (worker nodes, etc) should be in the cloud. Assuming you can get low latency connectivity into AWS with DirectConnect, this might work?

The speed of light is not your friend. You can never compete with the latency of a local network versus even the fastest connection to a cloud provider.

Yeah, that (worker nodes, async processing, etc) seems like one of those ideal use-case for clouds that I could definitely support as a viable option for companies that aren't ready for all-in cloud deployment.

Cpu also generally sucks due to cache thrashing on non-trivial datasets

Heh. So what is a good use case for EC2?

When you have highly variable load.

eg. Netflix probably spins up thousands of servers for a few hours.

What specific piece saw the biggest boost? My guess is MongoDB. Also if new servers take 1-2 hours, you are always paying for what you "think" will be capacity for peak load correct? How do you handle events that quickly and drastically increase load or txn/sec?

Actually, it depends. Stability and performance wise, I'd say our Lucene-based search layer has seen the most impressive jump. But yeah, Mongo instances loved the new fast IO as well :-)

The biggest problem with AWS is the outrageous cost of bandwidth. Even if you ignore all the other cost differentials, the bandwidth charges will kill you at scale.

Unfortunately cloud computing, or at the very least AWS, overpromises and underdelivers at scale. All in all economically viable use cases for cloud computing are very few and very specific at scale.

Since people are sharing their experiences in this thread, has anyone tried using Direct Connect to get cheaper bandwidth?

There aren't a lot of savings to be had. In addition to paying for the Direct Connect infrastructure you still have to pay Amazon per GB charges, albeit at a slightly lesser rate.

The only real solution is to move the bandwidth usage off AWS.

One question. If real hardware was always 50% cheaper then AWS wouldn't have been such a success. Can you please explain in which scenarios it makes economical sense to use AWS compared to real hardware?

In my personal opinion (based on some real experience) EC2 for a startup makes a lot of sense for prototyping your application and your infrastructure (when you don't really know what is it you're building and what components you're building it from). At this stage you just is it as an easy way to get a set of Linux computers connected to the Internet.

When you get to a point where you feel like this whole thing is going to fly, I'd recommend starting to think if paying the "cloud tax" (resources spent around EC2 stability issues and the cloud-specific stuff) a good idea in your particular case. There are some companies that benefit greatly from the elasticity of the cloud (the ability to scale up and down along with their specific load demands), but many companies aren't like that. If your traffic is relatively stable and predictable (you do not have 10-100x traffic surges) and your infrastructure load does not grow linearly with the traffic, using real hardware over-provisioned to handle 2-5-10x traffic spikes without huge decrease in performance may be a better idea in terms of the cost.

Of course, you could start the company based on all of the PaaS magic sauce (databases, queues, caches, etc) provided by Amazon nowadays and only use EC2 to run your application code (AFAIU that's the ideal use-case for AWS) and just kill misbehaving nodes when an issue occurs, but then you need to factor AWS costs into your business plan because migrating away from a PaaS is almost impossible at any large scale, so you are going to stay with Amazon for a very long time.

Great point. Also, on AWS if you use Amazon Linux you are even more 'locked-in'. Maybe that's the reason many hosting companies give you discounts at the beginning.

Pretty sure Amazon Linux is just their particular flavor of RHEL, right?

Correct. It doesn't have the exact same default configuration as, say, CentOS, but it's not much different otherwise.

in which scenarios it makes economical sense to use AWS compared to real hardware?

There's three types of workloads that make sense to run on EC2:

a) Extremely spiky/seasonal loads (batch jobs, event/campaign traffic)

b) Loads that can be structured as to run entirely from spot-instances (worker-pools)

c) Loads so small that the markup versus rented/dedicated hardware just doesn't matter

Yes, completely agree.

Maybe I'd just add one more case here: some users are OK with locking themselves up to AWS by treating it as a platform from the day one and building on top of AWS database/queue/etc services. For those people using EC2 just to run the app code and replacing instances when they misbehave may be a good idea.

This really is the best scenario for AWS (and Azure is also heading this direction somewhat). We spend pennies on the dollar in utilization fees on their services vs what it would cost to implement, maintain, support and scale equivalent services on top of compute.

And to your point somewhere else in here, it is a hell of a thing to try and move away from that platform. Yeah, it's super easy to beat EC2 on cost of compute resources, but really if you're running everything yourself on top of compute at AWS then you're doing it wrong.

Thanks for the answer. Maybe I need to understand better the advantages and disadvantages of cloud.

Really you'd want to compare EC2 and real hardware, not AWS and real hardware. AWS as a total package comes with a great many services, and if you're using more than a few of them then it can require a great deal of engineering time to set up replacements.

A lot of AWS services can be used by real hardware though, so it's not all or nothing.

For example, where I work we use S3 to store an archive of files but keep the working set of data cached on our web servers which are at codero.

We have video rendering servers which turned out to be much cheaper to do with a cluster of desktop-class hardware in a server closet at our office as opposed to the server grade GPU instances on EC2. The monthly cost of a single GPU instance at EC2 is more than the total cost of the hardware off of newegg.

However, for outages we have a script that spins up GPU instances on EC2 which is much more economical than having a separate set of servers somewhere just in case.

Yep, that is why in the article I specifically pointed out that we have migrated from EC2, but we are still a loyal customer of some of AWS services and those work really nice for us.

Great point, seems like after all the biggest issue with AWS is the inconsistent performance. But if you use S3 to store files this is not an issue...

Real hardware has always generally been cheaper.

AWS is a success because there are no upfront costs, it lets you scale up very quickly, and you don't need in-house hardware expertise to maintain your machines. People are willing to pay a premium for these advantages.

Multiple reasons:

1) you can't get new hardware delivered and get it up and running, all in under 10 minutes.

2) also, assuming the previously gathered hardware is not needed anymore, you can't just return it and say "i used it only for two days because i had a traffic spike, take this 200$ and we're okay.

3) you can't programmatically install, configure, reinstall and reconfigure hardware configurations, networking and services on phisycal services. At least, not as easily.

Many others, but these are very valid points.

Of course, Amazon is not the solution to all of the problems you could ever have, but still it solves a great deal of problems.

1) ok got me there (but SoftLayer does offer VMs too, which presumably have a faster turn up time)

2) SL has hourly physical server rental now (turn up is quoted at 20-30 minutes though)

3) SL has an API for ordering changes, you can setup a script to run on first boot (and probably system images too). What are you thinking for network configuration? Really the only thing I've had to configure on SL is port speed (somewhat API accessible, but not if they need to drop in a 10G card/put you on a 10G rack, etc), and disabling the private ports (API accessible, real time changes).

Well for me personally it makes sense if you are not of very large scale. Real hardware is cheaper but the costs of electric, transit, and colo space are not your only costs. If you are going to do physical hosting right you need to be in several locations, and you need to have a very good idea of what assets you will need for a period of time.

With AWS I can scale up easy, not have to worry about doing things like replacing failed hard disks, and most importantly I can be in multiple geographic sites for no additional cost. That to me right now is worth a 50% premium as the cost for doing that would be higher than that savings.

I think if you reach a certain scale, and have predictable usage, it is not a bad thing to setting up cabinets in 2 or 3 locations. We have found too that a lot of Colos are getting bought up and then will not lease you a few cabinets. They want to sell only to people who want a cage, or entire room. It is hard for small to medium sized businesses.

With Softlayer I get multiple locations (transparently routed 100% transparent backend connectivity), get 1-2 hrs provisioning speed, I do not need to worry about networking and hardware. Failed drives (when they fail, which happens rarely since the hardware they use is REALLY good) are replaced within an hour.

Colocation is a very different beast and I certainly would not encourage anybody to do that until a very large scale when rented hardware economics stop working for them.

That's not true. AWS is known to be more expensive in the long run. AWS was never about being cheaper for a mature company. It's popular because it's cheap/easy to get started, it's elastic, and it allows CTOs the freedom to balance capex vs. opex.

AWS is always going to include a premium because they take care of the DevOps portion of your infrastructure. There are plenty of virtual hosting companies that cost significantly less than dedicated hardware, if you won't need all the bells and whistles.

because they take care of the DevOps portion of your infrastructure

Sorry, but that is mostly a lie.

Running a non-trivial app on EC2 is significantly more complex than doing the same on (rented) bare metal. Scaling to a massive size can be easier on EC2, but only after you paid a significant upfront cost in terms of dollars and development complexity.

Is your app prepared to deal with spontaneous instance hangs, (drastic) temporary instance slowdowns, sudden instance or network failures?

Did you know that ELBs can only scale up by a certain, sparsely documented amount per hour?

Or that you need a process to deal with "Zombie" instances that got stuck while being added/removed to ELBs (e.g. the health-check never succeeds).

Or that average uptime (between forced reboots) for EC2 instances is measured in months, for physical servers in years?

Or that Autoscaling Groups with Spot instances can run out of instances even if your bid amount is higher than the current price in all but one of the availability zones that it spans?

The list of counter-intuitive gotchas grows very long very quickly once you move an EC2 app to production.

This comment is pure gold! That's exactly what I wanted to explain here and you did it so well. Thanks!

> AWS is always going to include a premium because they take care of the DevOps portion of your infrastructure.

This is a surprise to me, given that I work at an AWS shop doing things other people would call "DevOps". AWS doesn't automate provisioning or provide a (worthwhile) deployment pipeline, andAWS doesn't react (except in crude and fairly stupid ways) when something goes wrong or out-of-band.

> AWS is always going to include a premium because they take care of the DevOps portion of your infrastructure.

No, they don't. They provide the tools, its still up to you to orchestrate it.

> AWS is always going to include a premium because they take care of the DevOps portion of your infrastructure.

But wouldn't that apply also to SoftLayer?

No, there is significantly more complexity, overhead, and R&D to providing cloud services in comparison to bare metal. SoftLayer is actually a very expensive bare metal server provider. There are several good options that cost less than 1/3rd the price. Realistically, at just modest scale (a few physical servers), you should see 1/6th the cost of Amazon.

The main benefits of Amazon is that it: a) allows you to scale down i.e. buy services in smaller portions than complete physical servers and b) APIs c) integrated features

You could probably pay for one devops position once your infrastructure gets to 10 physical servers.

Could you please list some competitors to SoftLayer? It's hard to get reliable opinions on cloud providers backed up with actual experience. I'd really appreciate it!

There's a plethora of different bare metal / dedicated server hosting providers, so recommending one is like recommending what type of car you should buy. The most important criteria generally involve: location, managed vs unmanaged, quality vs price, class of hardware, network uptime and hardware replacement SLA, number of servers, smaller or larger provider

The best resource to research different providers are the webhostingtalk.com forums. You can also contact me and I will do my best to advise you based on your desired criteria.

*Full Disclosure: I'm the founder/owner of a dedicated hosting company

My personal experience within the last 7 years: * Softlayer – the best option in terms of quality of service, quality of hardware, quality and size of the infrastructure (geo distribution, etc). * Rackspace – nice, until you grow enough to get relatively locked in and then your prices start to go up, provisioning time suffers and their service turns into shit. * Steadfast – provisioning times up to a week, basically a joke in today's world. * Some German/EU providers like Hetzner – dirt cheap option with desktop-like hardware, failing quickly. Service is nowhere near SL level.

I could go on and on about those, but other options were even more painful.

Re: Hetzner - they do offer some cheap "desktop" grade servers but they also offer lots of real servers: https://www.hetzner.de/ot/hosting/produktmatrix/rootserver-p... (e.g. 120GB RAM, Xeon chips, SAS drives, hardware raid etc.)

I've been running three 32GB servers (each with 3TB storage) with them for 2+ years now and the only outage I've experienced is the switch (5 port GBit) dying once. Hetzner tech replaced it in under an hour.

These three servers cost me €263/month (that's total, not each). Included in that monthly price is an additional IPv4 for each server, a private 5port Gbit switch, remote console access and 300GB of DC backup space.

There are probably better deals available now (i.e. more RAM at the same price) than the one I'm on since it's old and not offered on their site any more (/makes note to self to call Hetzner sales)

Hetzner and OVH both use cheap hardware, but IME it's not hard to use the same primitives you'd otherwise use in something like AWS to build in redundancy at a price point well below AWS.

I wouldn't want to do it, which is why I'd rather work for somebody who'd pay for AWS, but I think there's a thing in there somewhere for those who want to dig.

I think cloud providers are mostly good for small/medium, fast growing startups. Big companies need more granular flexibility. The path to vertical integration in the software indsutry is a lot smoother than in many other industries, there are many levels:

You can...

1. Make your own hardware

2. Own/manage your own hardware

3. Rent commodity hardware from a standard hosting provider

4. Use IaaS (e.g. EC2)

5. Use PaaS (e.g. AppFog, Nodejitsu)

6. Use BaaS (e.g. Firebase, PubNub, Pusher.com)

The higher the level, the more technical flexibility you lose. The bigger the company, the more it makes sense operate at a lower level because there is no significant wastage being introduced as you move down the levels (you remove the middlemen so you can pocket their profits) and the capital cost to move between levels is relatively low.

If you compare software to another industry like cheesemaking for example, if you're a cheesemaker and you want to make your own milk, the next step is to buy the whole farm and then you have to figure out what do do with the meat (wastage). Going between these two levels is expensive and could mean doubling or tripling your expenses so it's not an easy move to make.

Cost-wise, AWS makes a lot of sense when growth is not easy to project, happens rapidly, or varies (seasonally, randomly, etc).

Of course, some of this has to do with your team's skill level, but I've had clients run up $100k+ monthly bills at AWS with a relatively small build-out. (and, wow, VPC migrations..)

For fixed or predictable growth patterns on a mature app/platform, a slow build-out on real iron will generally be significantly less expensive, all other things being equal.

However, there are other advantages to AWS that gets lost in this story, such as pre-built, highly scalable datastores. Comparing EC2 to real iron misses most of the real story on why the cloud is changing everything.

One of the hardest things I have to tell clients is not to build their own datastore/database in house or at EC2; sometimes the case is clearcut, and sometimes not so much, but if you have a datastore at AWS that gives you 80% of what you need, use it instead of rolling your own. (source: IAMA AWS Growth Architect)

It strikes me that more and more a critical selection when growing in this way is the DNS part. It needs to be back-end agnostic and provide an increasing amount of functionality.

Health checks and failover are must have now, but this article makes me wonder three things:

1) Are there any DNS services that understand geography of your "zones", i.e. route to and failover based on IP? (but are still platform agnostic).

2) How long can a DNS failover take worst case? You can technically set a low TTL, but don't a lot of ISPs just increase that to a minimum?

3) Isn't it better to replace some of the DNS failover with high availability dedicated load balancing?

1) Yes, there are several DNS service providers that offer BGP anycast with geographically aware failover / load balancing. UltraDNS and DYN are the larger ones.

2) Yes, some ISPs do set a minimum TTL. Although BGP anycast is the most effective as the first line, sometimes it makes sense to have your reverse proxy caching layer override that distribution based on GeoIP and redirect to a more suitable proxy node closer to the client. This is especially the case when people using recursive lookup DNS servers that aren't necessarily geographically close to them (e.g. It could also be useful in cases where TTL expiration hasn't caught up yet though.

3) No. Think of BGP Anycast DNS as distribution at a global level, and dedicated load balancers as distribution at the local level. You need to work out how to get the traffic to the load balancer first, and load balancing across distant geographies (high latency) results in horrible performance.

We use DNS-based load balancing along with an HA pair of load balancers in each datacenter. If the DNS health check fails, we stop sending traffic to a failing frontend LB. If failing LB is dead, we move its IP to the other one.

DNS TTL is not as big of an issue today as it was 5-10 years ago, when idiotic ISPs were trying to save on DNS resolving by ignoring TTLs. Nowadays you see an almost perfect drop in traffic when switching off a load balancer. Only bots and some weird exotic ISPs may keep sending traffic to a disabled box for up to an hour or two, but since DNS LB is only used to handle real emergency outages and for planned maintenance we could move LB IPs around, I really do not see it as a big enough issue to stop using the DNS LB magic :-)

Check out DynECT from http://www.dyn.com

Twitter, Mozilla and lots of other big names use them. I remember watching a webcast where Mozilla said they used Dyn's anycast failover service, with TTLs on their domains set to 5 seconds.

I've been using their DynECT entry level package ($30/month) for a couple of years and it's great.

Edit: you might also find this comment from an old thread interesting/useful: https://news.ycombinator.com/item?id=7813589 (go up two levels to phil21's first comment - HN isn't giving me a direct link sadly)

Cases where people try to use EC2:

- We're starting from scratch and think AWS will give us flexibility for cheap

- We have existing servers and think moving from them to AWS will give us flexibility for cheap

Cases where people move away from EC2:

- It was slow/unreliable

- It was expensive

Conlusion: You should use AWS EC2 in order to save money and have flexible resource allocation, but don't expect it to be stable or cost-effective.

I used Rackspace (and later EC2) originally because of upfront capital costs. I bootstrapped my company, and paying a thousand dollars or more, plus hosting/racking fees, just wasn't doable. $50-100 bucks for a reasonable Windows server? Easy deal to make when cash flow is small.

it's about a 5-7 year cycle; IT services are outsourced (this time, to the cloud), but as time progresses, all of those services are consolidated (usually, under new management) in-house.

I used to be able to set a watch by HP's cycle before the EDS buyout.

Finally sane take on Cloud BS :)

I'm telling you theres real opportunity for someone to create "the Uber of colocation/dedicated hosting".

OpenStack was supposed to facillitate this. Doesn't seem to get much use.

It does from my perspective. I could list many companies that use it for major things.

It doesn't always get publicity. I work for a major company, and we don't scream from the roof tops about it.

Interesting article, thanks for sharing. We actually just went the exact opposite because of the larger scale issues we were having with Softlayer. Do you feel like you lost any resiliency by making the switch to physical servers (more virtual instances on one physical server, servers in the same rack, etc)?

No, I really do not think going to EC2 could be beneficial in any way in terms of improving resiliency compared to Softlayer. SL allows you to control which VLANs your box will end up on. VLANs could be treated as racks (since they do not allocate more than one VLAN per rack). Then you have multiple DCs in one region (e.g. DAL01, DAL05, DAL07, etc) and you have many different regions (DAL, SEA, WAS, AMS, etc).

I'd be very interested what problems you were having with them and at what scale. If this is a private topic, we could do it over email or some other medium if you like. You can contact me by any of the means listed here: http://kovyrin.net/contact/

We were about 75% virtual with SL and 25% bare metal. One of our issues with the virtual stuff is when we started dedicating them to a set VLAN, multiple times we ran into an issue where some type of resource for that pod the VLAN was in would be maxed out (usually storage) so we couldn't create a new instance. The solution we were given was to let the system pick a VLAN but by doing that we had lost control of the placement and added some complexity to our architecture.

Aside from that it was mainly nit-picky type stuff, but still things that were annoying (networking issues between DCs, networking issues between pods, internal mirrored apt-get repos going out of sync, API is kind of blah, etc).

We use docker so having a few bare metal machines with tons of containers on them wasn't a great HA setup (for us at least), even running in two data centers. The fairly quick setup time though was a nice selling point.

When we went to AWS things just kind of worked. The API was easier to use and the GUI portal was way nicer/stable. So far we have not had any odd issues with our instances, but we also typically run them at about 50% capacity so that might be why. It is also still early so maybe things will come up in 6+ months that send us back to SL :)

Oleksiy, can you share the economics? You said 50% savings. What was everything you had running in EC2 (easy to figure out the costs), what were your equivalents in softlayer? Would be very interesting to see the economics.

Oh bare metal reloaded :)

RunAbove from OVH seems to be the best of both worlds. You can rent dedicated hardware by the hour for very reasonable prices.

AFAIK Softlayer has hourly-based rental for real hardware as well. Never tried that, so could not comment on it though.

In my experience firewalls and network is more expensive than servers....

Anyone offer competition to VPC?

Softlayer manages the network for us. They offer a fully isolated backend network (frontend connection to the world is optional). For frontend connections we simply use iptables.

"2-3 serious outages a week" with EC2?? Details, please.

Those weren't outages for a specific instance. But from the whole pool of instances we were running 2-3 would have networking issues, random unexplained hangs requiring an instance restart, huge CPU performance drops, IO hangs, etc, etc.

Thanks for the information

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact