Just to clarify, the auto-scaling was specifically for a pool of web application servers. At the time I gathered the numbers, there were 80 servers in that pool. In the last few months we've been moving toward a service-oriented architecture, and we've been able to use the same code to auto-scale the internal services. Of course it's not possible to auto-scale stateful servers like databases, but it's still saving us a considerable amount of money.
We implemented the auto-scaling in early 2012, so it's been in use for almost a year now. It only took about 2 weeks of engineering to build the system. It does need occasional maintenance, but it's still worth the effort given how much money it saves us.
That's not entirely true. :) It's just a lot harder to scale them elastically. At Netflix we've done some proof of concept work on elastically scaling Cassandra, although we don't have it in production yet. I think that is one of our goals for 2013.
But this is about saving money; not being a tenant of multiple providers; not spending more than "2 weeks."
If we're talking about read-only, I don't see why a DB, in the traditional meaning, has much to do with the issue.
These are not necessarily AWS faults; they are certainly trying to help. It might be the year for being a tenant of more than one PaaS though, but that is only going to up your costs and still not give infinite 9s.
AWS certainly feels pretty costly when you compare colo prices to the list price for on-demand instances. But one of the reasons I wanted to present our work is to show that you can use the cloud for a lot less than the list price. It takes work to buy reserved instances or run spot instances, but that does make it much more cost competitive.
By "going colo" they would need to layout all the upfront hardware costs. These would not be insignificant. They would then have all the operational overhead of maintaining all that gear.
You may have a higher monthly cost for AWS services - but without needing to buy any physical hardware, you have way less to worry about. Further - if you need to scale, it can be done in seconds rather than weeks/months given lead times for procurement, design time, implementation time (i.e. scheduling the install in the colo, coordinating the need for more space etc...)
This is just scratching the surface...
Leasing still ends up substantially cheaper than EC2. Never paid upfront for any colocated hardware I've been responsible for. So does managed hosting at a number of providers. Last time I priced out this, EC2 ended up 2-3 times as expensive as managed hosting (with no upfront costs for the managed hosting either), and the gap to leasing servers and putting them in a colo was even larger (but there you do need some scale before you cover the extra ops costs).
> They would then have all the operational overhead of maintaining all that gear.
If you're small enough, sure, your savings won't pay for extra ops people. But you don't need to be very large before the savings outweigh the cost of more ops people. And with managed hosting this is a non-issue - at that point you don't have any more ops issues than you have with EC2.
> Further - if you need to scale, it can be done in seconds rather than weeks/months given lead times for procurement, design time, implementation time (i.e. scheduling the install in the colo, coordinating the need for more space etc...)
It's not either/or. In fact, being prepared to use EC2 to handle peaks means the cost difference between self/colo hosted (+occasional EC2 use for peaks) and EC2 gets even larger, as you can run your own servers at far closer to full capacity without the risk you'd take if you didn't have that ability. Handling occasional peaks with EC2 is a great use of it, and definitively cost effective.
See above. But also consider that if instead comparing against the managed hosting option, a number of providers with auto-provision in minutes to a couple of hours once an order is placed. And many providers now also offer a mix between colo, managed hosting and EC2 like cloud solutions, so you can put your base load in a rented rack, scale in the mid term via managed hosting, and spin up cloud instances as needed if you want to deal with a single provider.
EC2 is great for "quick and dirty" temporary solutions, batch jobs or handling peaks that last less than about 6-8 hours a day, and I use it now and again for that reason. But the moment your instances are up more than about 8 hours a day, and you have more than a few of them, it will quickly start costing you more than the alternatives.
I think Adrian Cockcroft & Jedberg may disagree with this statement.
Netflix has made a point (and a business model) of pushing all their infrastructure costs for their streaming service to AWS for many reasons.
They clearly have a HUGE amount of traffic across their service, and they are very successful in keeping a lean team on staff that has a focused skillset while not needing all the IT ops folks on staff. The HW costs to support their service would be very large as well as the distribution of that HW across the [nation|globe] to support their userbase.
Also, I do not think you're properly accounting for all the design and support considerations.
In a large infrastructure implementation you're going to need quite a few ops specialties: (in smaller orgs, these roles can be collapsed, in very large orgs they are discreet. Your ops costs get high fast in large infrastructure deployments)
Support (deployment, ops, maintenance etc..)
With the need for 24/7/365 ops coverage - especially if you have multiple regions/internationally deployed infrastructure... you can see how this can get expensive.
So, I think there are a few sweet spots that can be looked at.
Finally, there is also the hybrid model, where you have your own base-line infrastructure which scales out to AWS to support larger load (CDN model)
They might. But either they haven't priced it out, or they have decided it's worth paying several times as much for some reason. Given that the high price of EC2 gets brought up and how I've never seen them actually address the pricing issue, I'm not going to speculate why they've decided to make that tradeoff. I find it quite baffling, though, and I'd be very interested in it if they have done a serious assessment of it somewhere.
> They clearly have a HUGE amount of traffic across their service, and they are very successful in keeping a lean team on staff that has a focused skillset while not needing all the IT ops folks on staff.
Given the very public, very extensive issues that in particular Reddit have had with their hosting, and how they kept taking the entire service down for maintenance seemingly always when I want to use it (since I tend to want to use it when Americans are sleeping, I guess), I'm not so sure this is a glowing endorsement of doing things their way. I certainly couldn't get away with the stability-record Reddit has - the CEO where I currently work would look at me as if I was crazy if I suggested even the amount of scheduled maintenance windows Reddit takes. I don't use Netflix, so I haven't kept track of how they're doing stability wise.
EDIT2: Actually looking at their numbers, and comparing EC2 prices, I'm fairly comfortable in saying that the setup we're running is actually larger than their in terms of total computing resources (but nowhere near them on bandwidth use), which is quite interesting...
> while not needing all the IT ops folks on staff.
You can have someone else do the IT ops for co-located services too. There are literally thousands of companies offering suitable services on an hourly basis, and dozens that offers it globally. Outsourcing ops is easy.
And with managed hosting, the ops you need to do yourself if you don't pay for extra service tiers is pretty much the same as for EC2. Someone else handles the hardware, just as with EC2. Someone else handles the network, just as with EC2. What you need to handle is what is installed on your servers, just as with EC2.
> The HW costs to support their service would be very large as well as the distribution of that HW across the [nation|globe] to support their userbase.
You pay for the HW with EC2 too. You just don't get to own it at the end. A typical colocated setup often involves leasing rather than purchasing, so you're still typically dealing with monthly payments. And if you don't want to own, managed hosting is still vastly cheaper.
As an example, leasing costs for our lates purchases of a quad-server box containing 4x dual hex-core 2.6GHz cpu's with 24GB RAM each, and 24x 256GB OCZ Vertex 4 SSD's, is about $600/month per unit. With their share of our rack space, power, bandwidth etc. the full hosting cost excluding our ops cost for this box is about $750/month (this is accounting for the fact our racks are currently nowhere near full, and so this price is higher than it could be).
Comparing them to EC2 is a bit tricky, since there's no direct equivalent. But to be very generous to EC2 and using a model that these servers substantially outperform, consider that 4 x single M3 Double Extra Large in US East is around $3300/month (which is indeed quite a bit better than last time I look - I'll grant that), and I have about $2550/month left to assign to ops every month for that single box.
In reality, for our loads the more direct equivalent would likely be the High I/O EC2 instances, which are almost 3 times as expensive.
(EDIT: Note also that this is before account for any bandwidth charges or costs for EBS volumes or similar for EC2; on the other hand you can of course cut the hourly cost by paying upfront for reserved instances - effectively you're then paying for "fractional managed hosting"... Last time I looked that still ends up more expensive, though the margin is definitively better)
If we had hardware that required enough extra time to deal with to cost us anywhere near that, we'd throw it in the garbage. We're in London. Here, that's 30%-50% the fully loaded cost of a mid-level ops person...
In reality our dev-ops cost per server (remember the box above is four individual servers) is ~$400/month and dropping as part of that cost is development work to automate more of our maintenance. That is our total. Of that ~$100/month is related to the physical server or network infrastructure and maintenance, and thus costs that are included in the EC2 cost.
The rest are related to maintenance of the VM's running on those servers as well as monitoring of the VM's that we'd still pay if were using EC2.
So comparing against the relatively underpowered EC2 instances above, one of our new boxes costs us ~$1150/month for equivalent service, or ~$2350/month total. So we're getting all the dev-ops and monitoring for our VM's "for free" and then some compared to EC2 despite being small enough that we have a lot of ops overhead.
Judging from our growth, our dev-ops cost per server with twice as many servers as we have today would likely only increase by ~ 10%-20%, and so our per-server cost would drop accordingly. Similarly, our rack and power costs would remain roughly constant as we have spare space in our racks, and so the per server costs would drop even more. I'd expect our rough per box costs for the quad server boxes above to drop to ~$900/month if the number doubled with "EC2 equivalent" ops included.
Keep in mind again, that this is comparing to an instance type I know these servers outperform comfortably, and excludes EC2 bandwidth and EBS or other services.
> In a large infrastructure implementation you're going to need quite a few ops specialties:
I don't know why you believe that EC2 is any simpler to work with than managed hosting in this respect. It isn't. Simpler than a co-located setup where you own your own servers, sure. You don't need much size before it's still cheaper, though.
Many hosting providers even provide API's for their managed hosting, and deploy them all using Xen, with the only difference being that you commit to pay for full months of service and a dedicated physical machine. At the same time you often get the benefit of being able to order custom setups tailored to your workload.
> Finally, there is also the hybrid model, where you have your own base-line infrastructure which scales out to AWS to support larger load (CDN model)
I mentioned exactly that, and it is what I recommend unless there are other reasons not to use EC2, because if you handle peaks via EC2, and your traffic is suitably spiky, you can load your dedicated base servers to 90%+ if you're careful instead of often <50% if you don't have any way of rapidly scaling up, and this drives the cost advantage of dedicated for your base load even higher.
Or just simply publish your outgoing EC2 ip pool list.
We cannot completely block EC2 because of Pinterest and that's a bad situation.
I've got my own db of hosting facilities which I made by taking 100M urls and doing a lookup on the hostname, then saving the IP found in a db. This gives you some level of confidence that a certain class 'C' is used for hosting.
Google is easy to identify this way, even with a spoofed user agent (which they do a lot now).
But this technique is not possible with EC2 because Amazon refuses to make a public database of what customer is using what.
That's part of their page-cloaking detection code.
How do your watchdog instances monitor other hosts? Do they track usage/load via snmpd, or something else?
Also, are they directly calling the EC2 API to launch or shut down hosts, or do you have an in-house deployment system?
If I was doing it over again, I'd just use Amazon's auto-scaling features for all of this. At the time we built this, EC2's auto-scaling didn't support some of the features we needed. Since then, they've made it a lot easier to do things like set up a repeating schedule for auto-scaling, rather than using metrics.
We only have one EC2 AMI that we use for all of our servers. That AMI is pretty basic; it only does enough to connect to our Puppet configuration management servers. Puppet then configures the boxes as web servers (or databases, or...) and adds them to the appropriate load balancer.
The watchdog isn't very sophisticated -- it just checks to make sure that the correct number of instances are running in each auto-scale group.
How did you decide on the right number of instances? The article mentions 20%--is that based on latency, a cost-saving target, or something else?
Also, how do you use regions/availability zones?
When we need more servers for an auto-scaled service, we open spot requests and start on-demand instances at the same time. For most services, we want to run about 50% on-demand and 50% spot. We have a watchdog process that continually checks what's running. It launches more instances whenever there aren't enough, and terminates instances when there are too many. So if the spot price spikes and a bunch of our spot instances are shut down, the watchdog will launch replacement instances on-demand. It will also request more spot instances once the price has dropped back to normal. In reality we don't often run into spot capacity issues -- maybe once a month, and it's almost never apparent to our users.
I spoke about this in detail at AWS re:Invent last month, and the full talk is available online here: http://www.youtube.com/watch?v=73-G2zQ9sHU
Number of Servers?
How much bandwidth per month?
How much space needed and at what rate is it growing?
Traffic stats (if possible)?
That said, this adds complexity to their systems with the only benefit being cost savings. Given that we can assume that no code is perfect, it's likely that at some point the auto-downscaling will cause an outage or period of slow responses, which could easily lead to lost usage and trust that costs them as much as they're saving on ops.
In other words, it's just good design.
Despite the fact that, in theory, a mainframe should never go down, most dinosaur pens will power cycle them regularly just to see what happens if you come up from a cold start.
A mate of mine worked in a dino pen where they did this on Saturday evenings. He told amusing stories.
That's not the only benefit. They also get much better reliability. The systems scale down with load, but they also scale up. As their load increases their system scales up along with it, giving greater reliability during increased load. Since they have to architect for that, it makes scaling up in general easier.
It took them 2 weeks to implement. That means it's saved a lot more than it took to implement (2 weeks times a couple of engineers) already.
To use your metric, if they saved close to $300K/year, that means they could afford to add 2 engineers to their staff, which is very significant.
I heard this at AWS re:invent, and thought I must have misunderstood. I'm still confused as to how this can possibly be a good strategy.
The pool of EC2 on-demand instances has a finite size -- it isn't magical -- and does hit its capacity limit from time to time. When there's high demand for EC2 instances -- say, when there's an outage in another AZ -- you're likely to see both spot prices going up and a lack of capacity in the on-demand pool. As a result, this strategy seems designed to only ask for on-demand instances at the times when they're least likely to be available.
You actually see high spot prices due to what at first glance seems "irrational"; incredibly high bids on instance types, sometimes 3-4x over on-demand prices. I suspect such high bids are placed by customers of the spot instances who absolutely do not want their workload terminated early by Amazon and are willing to take the risk of paying more to run it to completion.
If you are a webapp though, like Pinterest, you don't have this desire. Hence, it makes sense to dynamically switch.
See also http://blogs.platts.com/2012/11/20/electric_prices/
Edit: The blog-post is more convincing, but again, are those numbers really comparable? Maybe they are, I don't know enough about the subject, I just find it a little too simple to just compare numbers from different websites without deep knowledge of the topic.
Here's another link, the prices differ (?):
And vice versa: most electrical utilities have slow-starting, efficient-as-possible turbines that never get turned off (baseload -- coal is most common), and a bunch of relatively inefficient but flexible turbines (usually natural gas).
Actually, natural gas plants are at least as efficient as coal. They're just more expensive, especially if you turn them on and off a lot, which is pretty bad for the lifetime of a lot of components.
Wind or solar are similar in the sense, that you do not gain by turning them off, but their supply is not stable.
In case of sudden load drops where nuclear plants will be shut down for safety reasons this can cause availability to be affected for weeks afterwards.
I'd love to move to a provider that let me provision an extra instance or two for either failover or testing/staging but not be charged for it if I wasn't running traffic to it.
EDIT: I stand corrected, I might have been thinking of Rackspace's cloud (can't remember what it's called now) instead of AWS. But I know for a fact I am right on GoGrid (and pretty sure Azure) because I have a long email chain arguing about charges for provisioned instances in off states.
That being said, if I built custom app servers, I'd use SSDs because the cost is small for a system that doesn't need much storage.