A factor this post doesn't mention is bandwidth cost. If you use a lot of bandwidth and negotiate competitive hardware pricing, you also save with dedicated hosts.
Say you have 10 machines at SoftLayer and use 30TB a month. Each machine comes with 3TB and you pool your bandwidth for $25 per server so you can allocate the whole 30TB to your proxies. It's unknown what fraction of your server cost is applied to bandwidth, but we know the point where you start saving.
At Amazon 30TB of US-East EC2 bandwidth costs 10000 x .12 + 20000 x .09 = $3000.
If you estimate the bandwidth portion of your SoftLayer server cost at less than $275 you're saving money when using your full bandwidth allocation. With servers starting at $159, sub-$100 seems realistic.
In our case with dozens of servers and ~60 TB of bandwidth, we're saving thousands a month compared to EC2.
Yeah bandwidth can be insanely expensive on the cloud. Amazon at least can mitigate that if you're exclusively on their network so it's all incoming + across-their-network but as soon as you start kicking terabytes off their network you're going to feel it.
Dedicateds with massive bandwidth plans are very easy to come by, on top of the perks of dedicated io and raid (which also starts very cheap).
Yes, that is also an issue for us, we provide a large number of downloads (virtual machines, installers) that go into the TB and it would be quite expensive to move that over to EC2. One nice feature of Amazon, if you can afford it, is the ease of use with which you can put that content on their CDN
Be sure to read the documentation though -- at the very least, keep in mind that they do not honor query strings, so any dynamic (I.e. behind a script) content you may want cached, you may have to do a little webserver rewrite magic for.
What's the biggest technology mistake you've ever made - either
at work or in your own life?
Prior to Facebook, I was the chief executive of a small internet
startup called FriendFeed.
When we started that company, we were faced with deciding whether
to purchase our own servers, or use one of the many cloud hosting
providers out there like Amazon Web Services.
At the time we chose to purchase our own servers. I think that was
a big mistake in retrospect. The reason for that is despite the
fact it cost much less in terms of dollars spent to purchase our
own, it meant we had to maintain them ourselves, and there were
times where I'd have to wake up in the middle of the night and
drive down to a data centre to fix a problem.
I've always thought the "Drive to the datacentre" argument was BS. If you're writing your app for the cloud, you have to deal with spurious instances going away, degrading, etc. It is no different in the datacenter. If you're driving to the datacenter in the middle of the night to replace a disk or a fan, you're doing it just as wrong as if getting evicted from an EC2 instance causes you to have to scramble oncall resources.
In my experience, the highest operational cost with running services is managing the application itself - deployment, scaling, and troubleshooting. None of that goes away with the cloud.
I have to agree. I put our stuff in a colo 2 years ago and never looked back. Pretty much all servers come with some kind of remote console interface IPMI, and that's not terminal redirection, thats actually a totally self contained microprocessor and ether port that you can run on a separate subnet and control your server even if it's off. I updated the bios, reinstalled OS's, all via IPMI which is part of the motherboards. Add to that power strips that you can also control remotely and you're all set.
Our servers are in the Bay Area, I'm in Canada. I have NEVER had to drive/fly to fix anything. Never even had to use remote hands for anything. Sure some drives died, but standby drives are in place.
The costs are dirt cheap these days. You can get a full rack, power and a gigabit feed for about $800 in many colos in texas. We opted for equinix in san jose, which is all fancy with work areas, meeting rooms, etc when you are there, but the funny part is, we're never there!
I do like the virtualization for some maintenance/flexibility so we have a few servers that are hosts and we run our own private cloud where we get to decide where/what runs. Database servers on bare metal with ssd drives in other cases.
Best of both worlds.
It's so cheap you get a second colo in a different part of the country to house a second copy of your backups, and some redundant systems just in case something really bad happens.
Oh yeah and don't get me started on storage. We store about 100TB of data. How much is that on S3 per month? $12,000/month! A fancy enterprise storage system pays for itself every couple of month of s3 fees.
I have NEVER had to drive/fly to fix anything. Never even had to use remote hands for anything. Sure some drives died, but standby drives are in place.
Consider yourself lucky. We thought the same thing, but when a RAID controller died on us recently we really didn't know what hit us. It didn't just stop working, it started by hanging the server every now and then, then after a day slowly corrupting drives, then after a day or two it stopped completely.
Im a bit conservative when it comes to hardware like raid controllers. My choice was 3ware. They are by no means the fastest, in fact the performance sucks compared to others.
I went to a company that builds storage systems, but will build any kind you want, not locked into any controller. I trusted them when they recommended that by their experience is returned/fails the least.
Of course everything fails, so it's just a matter of time.
We have tripple redundant storage for file backup... active, 5 minute backup that is ready to be swapped in at one click, and long term. If something goes wrong with the active set or slows down, we just flip a switch and all our app servers use the new system that at most is 5 minutes behind. Old system gets shot in the head, and can be diagnosed off line. Shoot first ask questions later.
You are not necessarily doing it wrong, you may simply not have enough ressources ($$$) to buy enough hardware for complete redundancy.
When you get evicted from an EC2 instance you just switch to a new one, the cost is constant. When your piece of hardware at the datacenter goes down, unless you had the ressources for a spare one, you are hosed.
The cloud became the new hot thing and lots and lots of people, sites, enterprises jumped on board. Need to quickly deploy code without any understanding of how infrastructure works? Great, the cloud solves your problem! Who needs experienced infrastructure people anyway? Need to quickly respond to a spike in traffic? Great, the cloud solves your problem! Need guaranteed SLAs, reliable CPU, memory and disk performance? Yeah good luck with that. The price, at any sort of reasonable scale, of the cloud is almost always more expensive than doing it yourself (VMs). Cloud computing is completely bad but it's not the panacea that too many people make it out to be.
> One last thing about getting dedicated hardware. It's cheaper a lot cheaper. We have machines that give us 2-4x performance that cost less than half as much as their cloud equivalents and we're not even co-locating.
I've been saying this for ages, and every time people would fall over backwards trying to defend/prove their cloud mistake...
"The cloud is cheaper, faster, and infinitely scalable."
Except none of those 3 is true for any real world use case, but a few.
The moment a popular site like Reddit switches to the cloud, is the moment it becomes barely usable during certain times of the day.
So I'm running operations at Blekko, and just prior to that I was at Google in their eng/ops organization. I had been doing a whole lot of 'total cost of ownership' aka TCO computations around engineering infrastructure both for Google and of course now for Blekko.
The conclusion I came to is that for a 'web 2.0' type setup, the break even point was about 500 'machines.' That was in part because a 'machine' today has 8 - 24 'threads' and 2 - 40T of 'storage' and (at the time) 2 - 96G 'memory.' So in terms of 'cloud' you could easy run 10 "instances" on these sorts of machine. So 500 machines might be 5000 'instances' in an AWS type cloud.
Its this '10:1' multiplier effect (which is only getting better with bigger machines) and the management techniques of running the same config everywhere, etc. Means your TCO goes up more slowly than the capacity of the resulting infrastructure, so you can 'solve for x' where the two lines cross to identify the break even point. Everything east of that point you're coming out ahead of a 'cloud' based deployment.
What is still a challenge however is geographic diversity. If you wanted to put 500 machines 'around the world' so 125 machines in each 90 degrees (approximately) of longitude, the economics of getting 5 - 10 'cabinets' in places around the world can work against you. (you have more negotiating power if you're putting in 100 racks than if you are putting in 10 racks)
Beware the "same config everywhere" approach. It works up to a point, then it turns into a disaster. All you need is one totally broken change like "chmod -x /usr" to really make life interesting. You start bleeding machines and pretty soon you have nowhere left to host your tasks.
It's interesting, right? At first, you can handle a couple of totally mixed-up machines. Then it stops scaling and you have to start doing the whole "golden + syncer" approach.
Then you go too far and get into a monoculture. When the machines do break, it's impossible for humans to go around and fix them in any reasonable amount of time because there are too many. It's amusing when this happens and the solution put forth is "more administrative controls".
Yes, I have a white paper, 'Uniformity Considered Harmful' which addressed some of those issues 'in the large'. The interesting bits are two fold, one is where you draw the lines around the 'meta-computer' and how you approach balancing flexibility about provisioning, around utilization in production. I'll see if I can get blog posting up on this stuff.
You just need to do rolling deployments to make simple things like that go away.
Roll out a deployment to N machines (like say 10), run self checks (you have those, right?), if everything passes give them standard load. Over the next K period of time, periodically check up on them. After that, roll over to N*2 or N^2 nodes, continue until you have rolled out to your entire cluster.
This may be true if the available virtual servers fit your configuration needs. If they don't individually provide enough IO bandwidth, storage space, RAM, CPU time, etc and you have to add a second virtual server to make up for the lacking attribute the ratio could change from 10:1 to 20:1, making dedicated servers an even better option.
The more custom your configuration the better bare metal is.
I think the OP just meant that once someone like reddit moves into your cloud, your performance goes down. Meaning, the hardware that my little blog is on works great until that hardware also hosting reddit, and then my blog takes a beating because reddit is thrashing away with actual activity.
Not if the cloud you're running your workload on is large enough. The Amazon Web Services cloud could absorb many Reddit sized sites without any other customers experiencing an impact. AWS is _enormous_.
Not in the sense of running out of machines, but at least from my past experiences if your VMs share physical hardware with someone who is using their maximum IO, yours will suffer as well.
It makes it more challenging to load test. When there is no contention you can usually 'burst' to use more of the machine's resources, but you can't necessarily trust that you will always have that capacity.
"You would think that you could get better prices by signing 1 or 2 year contracts, but interestingly enough, out of the initial 5 providers we talked to the two that didn’t require contracts had the best prices."
"We’ve moved 100% of our machines that rely upon performant (sic) disks to dedicated servers hosted at Softlayer. Roughly speaking, this corresponds to about 80% of our hosting costs. Eventually, we’ll move everything "
If you don't have a contract there is nothing to prevent a provider from raising prices on you. The reason to have a contract is not just to get the best price. It's to have a price guarantee. Edit: Moving 100% of your machines won't be something you will want to do. If you have a contract you can renegotiate well in advance of any price increase.
Prices always drop?
People thought housing prices always go up as well.
How long is the price guaranteed for? I'm assuming mixpanel has this issued covered but it's important to keep it in mind. Not having a contract goes both ways.
There are other aspects of a contract that can be beneficial:
SLA, including compensation for outages and outs if they have too many. Sure, without a contract you can leave anytime, but a contract isn't necessarily a permanent trap. They are negociable. You can push for lower incidence allowances, an opt-out part way through the contract, and so on.
Support... more in actual human assistance with the move and issues. Depending on your size and the terms, that contract could be worth 6 to 7 figures to the company. That is some serious motivation to help make the initial experiences good and the help along the way.
Generally "no contract" means no fixed-term (eg: multi-year) contract. There is always a contract of some sort covering acceptable use, SLA, payment terms, termination rights, and so on. Otherwise you could use as much as you want for any purpose and not pay them a cent and they'd have no recourse. You can even have a contract locking in your price for five years but that you only pay month to month and can leave at any time without penalty.
Leasing physical servers could be regarded as "cloud", but usually wouldn't because that method of hosting towards not meeting the Resource pooling, Rapid elasticity and Measured Service (esp "automatically control and optimize resource") characteristics (of course, one can argue that it can do those thing, but generally( it doesn't).
"Cloud" isn't* an engineering term, and thinking about it in absolutist, engineering terms makes as much sense as thinking about "web 2.0" in engineering terms 5 years ago.
Infrastructure as a service (IaaS) cloud generally implies a few distinguishing features such as hourly utility billing and no long term commitments, rapid elasticity (provision/de-provision in minutes, not hours) and usually (but not always - there are bare metal clouds) virtualization. People often refer to software as a service (SaaS) as cloud, which I think is where it starts getting very ambiguous (i.e. is any hosted web application in the cloud???)
Same. Anything that isn't actually solely nailed down to your own hard drive is now 'cloud'. I think we need to find terms that marketers don't get excited by, and thus won't 'embrace and extend' to fit their needs.
> Cloud implies running in virtualized environment.
Unfortunately, 'cloud' implies nothing. Apple's iCloud isn't necessarily storing my data using VMs and EBS/BigTable type FS. They could be using dedicated W2K/IIS 4.0 boxes storing BLOBs in MSSQL and still be considered cloud by everyone.
Also, you don't necessarily take a huge hit in I/O with virtualization. VMWare on dedicated hardware with all the virtualization extensions will be pretty comprable and a lot easier to maintain then straight up raw iron.
To be fair, this is only true if the storage back end is appropriately configured.
You will take an I/O hit when instead of a single physical machine asking for a set of sequential blocks off the disks, you have 20 virtual machines asking for seemingly random blocks off the disks.
Replace disks with storage array if you'd like.. but the fact remains: more VM's will mean more storage contention. If you have the funds to have dedicated arrays per VM, hats off to you. Most people never do this, and I/O suffers a penalty. Virtualization has is price, and even that being said, I think it's worth it for most people.
Your point about having multiple VM's forming additional seeks is correct, but it misses the bigger picture.
Basically, on any reasonably sophisticated hosting infrastructure those arn't a problem.
If your problem is just seq->random conversion due to additional VMs then bcache/flashcache does a surprisingly good join at making that just plain go away with little additional cost.
On any reasonably sophisticated host you have a DSAN, which has a cool little stats trick, which basically means that as you add additonal vm's together the variance on the total IO load drops, and the load pattern itself becomes more and more normal, the larger and more uncorrelated you get.
That gives each vm more 'burst' capacity when required with many fewer failures (ie. the vm asks for more IO then the current capacity of the system).
This leads to a bunch of interesting stuff when you try to apply it in real world systems, either in HPC or in clouds.
I work over at orion, and our standard cloud VM's dominate dedicated hardware in IO. There really isn't much competition.
You find most of the 'virtualisation is bad for I/O' is either 'oversold I/O is bad for io' or 'Trying to push all IO for every VM on a box as well as all WAN traffic through one gigE port is bad for io'.
Agree but wouldn't you say that sole use of physical hardware (whether or not you own or lease that hardware and regardless of physical location) is really the issue that separates cloud from non cloud?
I consider the MT servers cloud since the hardware is shared. And what we are paying for is simply memory, transfer and disk space. I don't even know what hardware we are running on there and I don't know anyone else that is running on that hardware either.
But I would also consider the servers in the office and the colocated servers non-cloud even if they were leased (which they are not).
I think currently the biggest problem with the cloud is the inability for cloud providers to truly estimate and understand their risk, which also means that customers don't have the ability to understand their risks either.
For example, Amazon could estimate 99.95% downtime, because of physical and geographical redundancy, etc. But this analysis would be faulty, as their outage earlier this year showed.
There are a litany of long-tail black swan events that could bring down entire datacenters, that people just can't anticipate. Not even including earthquakes, terrorist attacks, etc, but even simple upgrades or misconfigurations like the one that took down their datacenter in the East Coast. Yet, they still advertise a SLA of 99.95% availability, etc. Is the risk to downtime really only 0.05%? Was the event that occurred really a 3 standard deviation event? I highly doubt it.
This complete lack of true ability to estimate risk means that customers also have essentially an inaccurate view on what their risks are. Like the one commenter who said that a small business ran their POS device over the cloud, if you told them that they would be down 2 days out of the year, would they really be interested in that? Probably not.
In a similar vein, the authors were likely promised great uptime, but no guarantees on I/O or CPU performance, which is something you don't think of. The cloud provider doesn't have to be down for your web service to be drastically affected. I suppose since this is all new, the customers are learning which questions to ask, and the cloud providers are learning which things to guarantee, so hopefully this is worked out in the next year or so.
While I agree with you providers obviously have been unable to avoid "long-tail black swan events" (awesome phrase!), and that too many businesses and users jump to the cloud without actually understanding their architecture, availability does not mean what you are implying.
99.95% availability means your site should be "available" 99.95% of the time. It does not mean you have a 0.05% chance of a disaster, it means you will not have more then 21.5 minutes per month of outages. Those 21.5 minutes might be during your most critical time. They might even all be added together for one 4 hour downtime before you're demoing to VCs and still not violate your SLA for the year.
One, datacenter failure happens with any hosted application that runs within a single datacenter. Most other types of hosting also have a litany of other failure modes that go away with cloud, to be replaced by other modes. For example, you no longer have to deal with failures of individual hard drives, but now you have to deal with failures of EBS clusters.
Two, cloud provides a standard system that makes HA easy. Rather then having to do dual dc failover in hardware and having to deal with bgp, anycast ips/dns failover, dealing with splitbrains and hw stonith/etc. All of this is just taken care of for you and/or made a lot easier.
Thanks for sharing the experience. In your particular case it seems the right decision to move to a dedicated provider. The reasoning was similar to GitHub's (https://github.com/blog/493-github-is-moving-to-rackspace) Cloud environments make the most sense at either end of the spectrum. It works if you are Netflix, where the choice is not to purchase a few servers but build your own Cloud infrastructure (http://perfcap.blogspot.com/2011/08/i-come-to-use-clouds-not...) or if you are getting started, when you are cash-constrained and there are a lot of uncertainties, allows you to run quick experiments etc. Take our case, we provide a cloud hosting tool (http://bitnami.org/cloud) but we don't run all of our systems there, they are divided between "traditional" providers (one of them Softlayer) and AWS. It is a bit of a pain, but for our current requirements and budget, it works nicely.
Rackspace's primary business is dedicated hosting. In fact they let you have both dedicated hosts and cloud hosts and allow dedicated and cloud hosts to talk to each other.
I'm curious why they didn't just switch to Rackspace's dedicated hosting. It would have given them the performance they needed while retaining the flexibility of being able to quickly spin up cloud machines in the same datacenter as the dedicated machines.
You're wrong about that. The big hosting providers do not put public cloud hardware next to dedicated hardware. Public cloud hardware goes in big, homogenous racks in a specific area of the datacenter. Dedicated hardware is very mix-and-match; you have all sorts of servers and network devices racked together. Connecting the two environments is no small feat, given the VLAN limitations of most current switches.
Being in the same datacenter is a feature since it means there is extremely low latency between your dedicated and cloud machines. Sub-millisecond vs. 10s of milliseconds round trip times can make a huge difference.
Rackspace is very expensive... almost 2x their competitors. At work we left rackspace recently, because we needed to upgrade our server and it was going to jump up to $1200! a month. Plus we found out that we were running on two year old discontinued hardware. Although we still had a SLA that didn't apply to hardware failures, which would be fixed on a 'best effort' basis and could mean a days-long outage. We moved (to the cloud ironically) and got a huge performance increase, for half the price (and that includes paying a company to monitor/manage it for us).
If I have to guess, I would guess cost is too high to have rackspace managed dedicated server. I've been working with rackspace's managed dedicated for 10 years. They are pricey but for a small shop where you dont have dedicated network/hardware guy. It's worth every penny. 99% I called their fanatic support, I hear human voice after 1 ring.
If you are running on top of your dedicated servers, the providers are now all providing a dedicated VLAN for your servers. This allows you to deploy your own VM management software on top of your dedicated servers.
If you can, I recommend you to use Ganeti with Xen or KVM (I use KVM). Rigorous development, very friendly developers and very well designed tools. No wonder it is used internally at Google.
Yes, but, it would be a dead horse, they would have not been keeping it in production and improving it over the past 5 years. I think this is the key, well designed, well maintained, well used for critical stuff in a big company all that over several years.
Except in many places in Australia is actually cheaper than buying because poorly-educated investors have been sucked into the dream of financial freedom through owning an asset that always increases in value.
In short, "cloud" services (aka Software as a Service in the cloud) are brilliantly useful. No one will seriously question the usefulness of Dropbox, for example.
Platform-as-a-Service is also obviously useful. Google App Engine is qualitatively different from running your own hardware.
Infrastructure as a Service is useful too, if you know how to use it, but in a more limited set of circumstances. The circumstances where it's useful include things like: having an event-based site (e.g. a site related to a sport event) which will need a lot of hardware for a short time and not much afterwards; an app where, due to your solid marketing channels, you expect fairly rapid and unpredictable growth of the usage and you will need to provision new servers quickly; an app where the load varies significantly throughout the day or the week (e.g. something that does batch processing of a lot of data on a regular basis, but is idle the rest of the time).
I think the cloud is also good for getting an MVP out the door. The cloud makes sense for startup that has few employees and wants to focus on the product rather than managing bare metal. Of course if/when they become successful it makes sense to move onto their own infrastructure.
This how I always thought it went, so stories of growing companies moving off the cloud don't seem like a big deal - just a natural progression.
Except of course when that small business relays to much on the cloud. I know a local gym that decided to "host" their cash register. I told the owner it was a mistake. He basically said I was suck in the past.
Until his Comcast went/slowed down.
Or a local non-profit who's board came up with a great way to save money - host their phone system in the cloud. A local carrier was happy to sign them to a 3 year contact. Even supplied the 42 phones. Now, they lucky if they can make calls mid day. It's so bad, that if 10 phones are in use, the next call will sound like you calling from a wind tunnel. And forget about calling at peak times. What does the carrier suggest? Upgrading to a T1. Of course that carrier never mentioned this when selling the service in the first place. And personally, with 42 phone plus 50+ computers and other devices I'm suggesting T3 (cost down here about 500 - 600 a month).
My point is, our infrastructure (at least in South Florida) isn't there yet. Sure the cloud is a great idea. But if you can't reach it, it's useless. But that doesn't stop the marketing. Or the complaints.
Most small business buying subscribing to these services aren't tech. In the phone market (as an example), carriers are selling hosted "solutions" for less then $75 a month. Comcast is too. I like Comcast. But you can't run an office with 25 phones and pc's and other devices on Comcast. At least not in Florida.
The other problem, is most IT firms down here are pushing they own "hosted solutions". Everything from email to accounting services. For cheap.
Now let me be clear some services I think make perfect sense in the cloud. Even with unreliable connectivity. Like email, storage, messaging, But your core business, services you must have to run your business need to remain under your control. Period.
And the last thing; many small business don't really understand just how important IT is to their business.
if it was purely about price, then these people would've purchased these options regardless of the shortcomings if they had known about them beforehand. and if this is the case, they went in with eyes wide open and the cloud is right for them.
I doubt that. Trust when I tell you this, the carriers (and I'm including Comcast) sales people do not tell the customer the downsides and limitations.
If you don't believe me. Try it. Call the business sales units of the carreirs.
Now I believe in buyer beware. But that's the problem... most small business are suffering cash flow problems. When something cheaper comes along there is no "buyer beware". They think about lowering their monthly bills. An of course, in the end they get bit in the ass.
It's just mind-blogging to me how many business owners are so ignorant about the tech that runs their business.
> It's just mind-blogging to me how many business owners are so ignorant about the tech that runs their business.
in my mind, this is the very definition of "uninformed choice". you don't have to be uninformed on purpose, you could also be kept in the dark intentionally by salespeople. you're still making a choice based on an incomplete picture.
Like the blog says, CPU is fairly cheap on the cloud. Memory is fairly expensive and normally good disk IO cannot be purchased (Bytemark's BigV looks different in this regard). We have a service (http://www.mynaweb.com/) that has a bunch of mostly CPU bound services and a few IO bound ones. We also want the CPU bound services to be near to customers. It makes sense for us to acquire cheap cloud instances around the world for the CPU bound services and keep the IO bound stuff on dedicated hosts.
Who in the US provides cloud with disks comparable to dedicated servers? (or otherwise 'good')?
I say this because I run a cloud company called orion in Australia, we produce cloud VMs with faster then dedicated disk performance. When we were last in SV pitching, nobody knew of anybody in a similar sort of space.
I'm just asking because you seemed to emphasise 'normally' and 'good'.
I think it depends what you mean by 'beats'. In term of performance bare metal will always win, and in a stable situation it can even work out economically.
Where the cloud comes in useful is when you're just starting out and have no idea what kind of demand you should be planning for. Are you going to get swamped? If so, no big deal- spin up a few more machines. If not, you're sitting pretty. Making those kinds of changes with a bare metal setup takes time.
All this said, I am surprised by the number of successful, profitable companies that still use the cloud. Once you know your numbers and have a reasonable outlook for the future you should at least investigate getting some hosting of your own.
Amazon S3 is awesome and different from intrastructure in the cloud. If you choose to use S3 and a lot of your servers rely on fast access to it, then using EC2 too can often make sense.
Rackspace offers a good compromise where they have cloud services like Cloud Files and Cloud Servers, but you can also have dedicated servers with fast access to your cloud stuff.
As others have mentioned, it’s a lot easier to achieve geographical diversity with a cloud provider like AWS.
Another thing to keep in mind with the AWS cloud: if you have a huge setup, you can always start the largest instances, and you will almost certainly have the physical server all for yourself. And for a fee of $10 per hour per region, across _all_ your instances, you can have totally dedicated instances where you’re guaranteed that none of your instances will be on shared hardware.
($7,200 per region per month sounds like a lot, but it is a fixed fee and a drop in the bucket for people with huge EC2 deployments.)
Thanks for this post. I currently own/run a niche social networking site with a little over 60,500 registered users and see several thousand users on the site at a time. I bought my own servers co-locate them in a large hosting facility in northern Virginia. The initial costs were the hardware, of course you can lease it if you want which can be cheaper if you intend to upgrade existing servers on a yearly basis. We do not need to do that quite that often, so buying them outright has been economical for us.
The ability to customize our boxes has been a big advantage for us and given the hosting facility has all the redundant power sources and bandwidth pipes we never see any problems. I will mention that most of our traffic is east coast based and given our servers are on the east coast we have not seen any problems. If we see traffic expand we would look to put some boxes on the west coast or midwest.
At one point I looked into us switching into the Cloud with AWS and and Rackspace, the costs were much more then we pay now.
In regards to bandwidth, most of the clouds pricing I have seen are based on total usage, our bandwidth is based on the 95 percentile usage. And it's not capped, so if we have a spike of 20mg/sec the pipe is open to fulfill it. The 95% pricing model as worked very well for us. We average a few mgs/sec and our bandwidth costs are under $50/month. I'd add to author when he talks about negotiating, do it, you can get a great deal (s).
I looked into AWS for another start up I am doing in the communications space and we tried it, for not a lot of users on the cloud it was very expensive. We moved to Rackspace and have limited our alpha users to $100, it's still expensive and as we move to launch over the next year we will go with dedicated servers.
I run a service that constantly pushes over 90mbps over the wire (about 30TB a month) and I pay just over $100 a month for two servers. The same bandwidth usage on EC2 (or any other 'cloud' provider for that matter) would cost me thousands.
The quality of service and network is also widely different and varied. Hetzner are awesome for the cost, don't get me wrong. But at times their network is terrible and their service seems to vary hugely.
You don't get these problems with Softlayer, you pay significantly more and get significantly better almost-guaranteed service.
If you're looking for tier-1 network traffic, 100tb.com is with Softlayer, and since they have so many servers, they get massive discounts and can give you 100TB of softlayer traffic for relativly cheap.
"Rackspace Cloud has had pretty atrocious uptime over the year there has been two major outages where half the internet broke. Everyone has their problems but the main issue is we see really bad node degradation all the time. We’ve had months where a node in our system went down every single week. Fortunately, we’ve always built in the proper redundancy to handle this. We know this will happen Amazon too from time to time but we feel more confident about Amazon’s ability to manage this since they also rely on AWS."
There was some statements from Amazon employees that Amazon isn't hosted on AWS.
GAE definitely does not deliver more consistent performance. I've done a ton of performance tweaking on GAE as of late and my bottlenecks are now reduced to random points in the code between RPC's where my Python thread is obviously locked out of the CPU it was running on.
I should say that the variation this leads to is at max around two seconds. I believe this is due to App Engine doing some dynamic grouping of slow applications. So if your app has fast response times, it will be grouped with other apps having fast response times, so the maximum downside is limited.
If you enable multi-threading for Java, yes. Although they are releasing multithreading with Python 2.7 in 5 days. This statement in their optimization article for the new rules is also interesting:
Muli-threading for Python will not be available until the launch of Python 2.7, which is on our roadmap. In Python 2.7, multithreaded instances can handle more requests at a time and do not have to idly consume Instance Hour quota while waiting for blocking API requests to return. Since Python does not currently support the ability to serve more than one request at a time per instance, and to allow all developers to adjust to concurrent requests, we will be providing a 50% discount on frontend Instance Hours until November 20, 2011. The Python 2.7 is currently in the Trusted Tester phase.
I've also noticed my app's speed pick up dramatically in the past few days. Perhaps because people are leaving before the new billing takes effect.
I don't mind the GIL much because I can just make a request to get a new thread going. :)
This is somewhat unrelated, but I remember reading in one of Mixpanel's job posts that they had over 200 servers. 200 for a company of their size that charges by the data point seems kind of a lot. I've worked at a couple tech companies who get by with an order of magnitude less servers and deal with the same load that I bet they deal with. So either they were exaggerating by re-defining what a "server" was in the cloud, they have tons of (costly) freeloaders, or their infrastructure is inefficient.
I wouldn't doubt it. But they also don't publish any figures so its difficult to confirm. I work for one of their competitors and we most likely have the same availability requirements... anyways, just curious. Here's where we're at, as a comparison http://bit.ly/qLrKOt
Also interesting what is simply an artifact from the fact that none of the current "clouds" out there were built to deal with, well, actual loads.
Some things are small, but seem rather strange. Why does no cloud give out 95/5 billing? Why isn't there more resource limiting/etc?
I see a bunch of things leaking out of EC2. People forget that EC2 was designed to deal with large numbers of stateless servers and it's not good for much else. They take the limitations of that and the rest of the AWS platform and apply it to the 'cloud' overall.
Two examples would be from the 'variability' section. CPU limiting under XEN (the hypervisor used by both Amazon and Rackspace) is trivial. The fact that CPU is so variable, especially for smaller tiers, is thus rather interesting.
Similarly with IO. With Rackspace, you are on local disks. As such, unlike Amazon, Rackspace has no defendable reason for being able to starve other users of disk IO.
Also, just as a general data point. There is no real reason why a cloud should be in the same order of magnitude of cost as anything you could touch. Fairly simple reasoning, everything they buy is at massive scale, and there is a very minimal fixedish management cost to deal with all the hardware. What you can work out is that even given almost list prices you are still looking at thousands of percent ROI on cloud servers. What that then says about the market is that there is a current monopoly due to a lack of cloudsmithing knowhow which is the cause of the current situation. Over time, I would expect cloud products to simply dominate standard dedicated servers/colocated servers for most applications.
I use unmanaged dedicated servers with Serverbeach, but I assume SoftLayer has similar tools. If I totally screw up my server there are tools to boot in rescue mode or just wipe the machine and do a clean OS install.
For testing out puppet processes I use Vagrant with VirtualBox on my local machine.
"We recently added a new backup machine with a crappy CPU, little RAM, and 24 2TB drives in a hardware RAID 6 configuration. You can’t get that from a cloud provider and if you find something similar it’s going to cost an order of magnitude more than what we’re paying."
I tend to agree with his points but, for back ups the cloud is perfect. If he stayed in the cloud, he wouldn't even need the server in question.
Engineering is always about making the right compromises. Why does it have to be 100% dedicated or 100% cloud? What most people don't realize is that it really is a continuum.
For example, why not run the disk performance sensitive DB server on a dedicated machine, while fronting the whole arrangement with proxies and app-servers hosted in the cloud? Ok, so there are latency considerations to be made, but you can see that mixed architectures can make sense.
I think what's stopping people from considering this is that there haven't been good cross-provider network virtualization solutions available. But if you could create your own network topology and your own layer 2 broadcast domains, no matter where your machines are located, things are starting to look up.
There are a number of network virtualization providers out there now, which you might want to look at to see what's possible. Disclaimer: I work for vCider ( http://vcider.com ), which provides solutions for on-demand virtualized networks, which can span providers and data centers.
The disk IO problem he mentions might start to go away once SSD drives go into wider use by cloud providers. As far as I can tell, that is his main beef. It's certainly a reasonable complaint.
However the point about pricing is less valid. Cloud hosting providers must invest in lots of extra infrastructure to allow for the flexible provisioning they offer, so any comparison that assumes no need for that flexibility is flawed.
Amazon offers spot instances and various other pricing innovations to help align the customer with Amazon's internal provisioning risk.
I could see Amazon offering lower prices if the user commits to longer term provisioning. This is a simple pricing update that would likely negate any cost advantages of non-cloud services.
The bleeding edge hardware aspect of his argument is valid for some businesses but not likely applicable to most.
Can the multi-tenant problem be solved by just using the beefiest EC2 instance available? At some point don't you become the only one occupying that box? And if your site has the volume that MixPanel does, I assume you wouldn't be exposing yourself to single-point-of-failure issues because you'd still have many such boxes. Can someone more knowledgeable please address these?
Yes, my experience with the problems of performance on virtual servers is disk related. That's great that you get guaranteed CPU, memory, bandwidth etc... if you're getting 3 MB/s disk throughput, it doesn't matter, your site will slow to a crawl. I moved away from Slicehost for this reason, and have never had such issues with Linode.
Since their app is highly optimized, profiled, and tweaked low level C then it's no wonder they could not tolerate the CPU variances of the cloud. Even so at least on EC2 there's much less noisy neighbor issues on the bigger CPU instances, for example on extra large instances you essentially have the entire server to yourself.
Not to mention, that if you _really_ need to, you can spin cloud instances at Softlayer as well. We are running on dedicated hardware most of the time, but if we anticipate a temporary and significant influx of the traffic, we run a few additional pre-built Cloud Computing Units (as they call it) and are ready in 20 minutes.
"After deciding to go dedicated, the next step is choosing a provider. We got competing quotes from a number of companies. One thing that I was surprised by — and this really doesn’t seem to be the case with the cloud — is that pricing is highly variable and you have to be prepared to negotiate everything. The difference between ordering at face value and either getting a competing quote or simply negotiating down can be as much at 50-75% off. As an engineer, this type of sales process is tiring, but once you have a good feel for what you should be paying and what kind of discount you can reasonably get, the negotiations are pretty quick and painless.
We ultimately decided to go with Softlayer for a number of reasons:
- No contracts. I don’t think I really need to explain the advantage. You would think that you could get better prices by signing 1 or 2 year contracts, but interestingly enough, out of the initial 5 providers we talked to the two that didn’t require contracts had the best prices.
- Wide selection. Softlayer seems to keep machines around for a while and you can get very good deals on last year’s hardware. Most of the other providers we contacted would only provision brand new hardware and you pay a premium.
- Fast deployment. Softlayer isn’t quite at the cloud level for deployment times, but we usually get machines within 2-8 hours or so. That’s good enough for our purposes. On the other hand, a lot other hosting companies have deployment times measured in days or worse.
One last thing about getting dedicated hardware. It’s cheaper… a lot cheaper. We have machines that give us 2-4x performance that cost less than half as much as their cloud equivalents and we’re not even co-locating (which has its own set of hassles)."
There are some cloud providers where you can allocate dedicated servers with virtualization on top. That way you can manage exactly what runs on each instance, while still have the flexibility to allocate more server instances quickly for handle growth.
The disk problem is always something I've always been fighting with when it comes to virtual private servers. I've never had to do so much optimizations with dedicated servers as I had to do with VPSs.