
Why We Moved Off The Cloud - btmorex
http://code.mixpanel.com/2011/10/27/why-we-moved-off-the-cloud/
======
jbyers
A factor this post doesn't mention is bandwidth cost. If you use a lot of
bandwidth and negotiate competitive hardware pricing, you also save with
dedicated hosts.

Say you have 10 machines at SoftLayer and use 30TB a month. Each machine comes
with 3TB and you pool your bandwidth for $25 per server so you can allocate
the whole 30TB to your proxies. It's unknown what fraction of your server cost
is applied to bandwidth, but we know the point where you start saving.

At Amazon 30TB of US-East EC2 bandwidth costs 10000 x .12 + 20000 x .09 =
$3000.

If you estimate the bandwidth portion of your SoftLayer server cost at less
than $275 you're saving money when using your full bandwidth allocation. With
servers starting at $159, sub-$100 seems realistic.

In our case with dozens of servers and ~60 TB of bandwidth, we're saving
thousands a month compared to EC2.

~~~
benologist
Yeah bandwidth can be insanely expensive on the cloud. Amazon at least can
mitigate that if you're exclusively on their network so it's all incoming +
across-their-network but as soon as you start kicking terabytes off their
network you're going to feel it.

Dedicateds with massive bandwidth plans are very easy to come by, on top of
the perks of dedicated io and raid (which also starts very cheap).

~~~
herbivore
With many (if not most) reputable dedicated server providers nowadays Inbound
traffic is not calculated towards your overall traffic. In other words, it's
free and generally unlimited.

~~~
benologist
I did not know that. My provider counts inbound which sucks because that's
most of our traffic but we run at about 6 out of 10 terabytes per server per
month so it's not a problem... for now.

~~~
dhimes
It double-sucks because it's not easy (AFAIK) to gzip traffic from web app to
the server.

------
tzury
Facebook's CTO thinks the opposite or perhaps even alike.

IMHO, a startup should not _start_ with dedicated, but perhaps when you get to
a certain size, perhaps, dedicated hardware, team and bandwidth is the way to
go.

Just like you will not have a dedicated chef to cook your meal at start, but
rather outsource meals, etc.

\-- <http://www.bbc.co.uk/news/business-12406171>

    
    
      What's the biggest technology mistake you've ever made - either 
      at work or in your own life?
    
      Prior to Facebook, I was the chief executive of a small internet 
      startup called FriendFeed.
    
      When we started that company, we were faced with deciding whether 
      to purchase our own servers, or use one of the many cloud hosting 
      providers out there like Amazon Web Services.
    
      At the time we chose to purchase our own servers. I think that was 
      a big mistake in retrospect. The reason for that is despite the 
      fact it cost much less in terms of dollars spent to purchase our 
      own, it meant we had to maintain them ourselves, and there were 
      times where I'd have to wake up in the middle of the night and 
      drive down to a data centre to fix a problem.

~~~
z2amiller
I've always thought the "Drive to the datacentre" argument was BS. If you're
writing your app for the cloud, you have to deal with spurious instances going
away, degrading, etc. It is no different in the datacenter. If you're driving
to the datacenter in the middle of the night to replace a disk or a fan,
you're doing it just as wrong as if getting evicted from an EC2 instance
causes you to have to scramble oncall resources.

In my experience, the highest operational cost with running services is
managing the application itself - deployment, scaling, and troubleshooting.
None of that goes away with the cloud.

~~~
necro
I have to agree. I put our stuff in a colo 2 years ago and never looked back.
Pretty much all servers come with some kind of remote console interface IPMI,
and that's not terminal redirection, thats actually a totally self contained
microprocessor and ether port that you can run on a separate subnet and
control your server even if it's off. I updated the bios, reinstalled OS's,
all via IPMI which is part of the motherboards. Add to that power strips that
you can also control remotely and you're all set. Our servers are in the Bay
Area, I'm in Canada. I have NEVER had to drive/fly to fix anything. Never even
had to use remote hands for anything. Sure some drives died, but standby
drives are in place.

The costs are dirt cheap these days. You can get a full rack, power and a
gigabit feed for about $800 in many colos in texas. We opted for equinix in
san jose, which is all fancy with work areas, meeting rooms, etc when you are
there, but the funny part is, we're never there!

I do like the virtualization for some maintenance/flexibility so we have a few
servers that are hosts and we run our own private cloud where we get to decide
where/what runs. Database servers on bare metal with ssd drives in other
cases. Best of both worlds.

It's so cheap you get a second colo in a different part of the country to
house a second copy of your backups, and some redundant systems just in case
something really bad happens.

Oh yeah and don't get me started on storage. We store about 100TB of data. How
much is that on S3 per month? $12,000/month! A fancy enterprise storage system
pays for itself every couple of month of s3 fees.

~~~
Huppie
_I have NEVER had to drive/fly to fix anything. Never even had to use remote
hands for anything. Sure some drives died, but standby drives are in place._

Consider yourself lucky. We thought the same thing, but when a RAID controller
died on us recently we really didn't know what hit us. It didn't just stop
working, it started by hanging the server every now and then, then after a day
slowly corrupting drives, then after a day or two it stopped completely.

~~~
necro
Im a bit conservative when it comes to hardware like raid controllers. My
choice was 3ware. They are by no means the fastest, in fact the performance
sucks compared to others. I went to a company that builds storage systems, but
will build any kind you want, not locked into any controller. I trusted them
when they recommended that by their experience is returned/fails the least. Of
course everything fails, so it's just a matter of time. We have tripple
redundant storage for file backup... active, 5 minute backup that is ready to
be swapped in at one click, and long term. If something goes wrong with the
active set or slows down, we just flip a switch and all our app servers use
the new system that at most is 5 minutes behind. Old system gets shot in the
head, and can be diagnosed off line. Shoot first ask questions later.

------
shaggy
The cloud became the new hot thing and lots and lots of people, sites,
enterprises jumped on board. Need to quickly deploy code without any
understanding of how infrastructure works? Great, the cloud solves your
problem! Who needs experienced infrastructure people anyway? Need to quickly
respond to a spike in traffic? Great, the cloud solves your problem! Need
guaranteed SLAs, reliable CPU, memory and disk performance? Yeah good luck
with that. The price, at any sort of reasonable scale, of the cloud is almost
always more expensive than doing it yourself (VMs). Cloud computing is
completely bad but it's not the panacea that too many people make it out to
be.

~~~
gpapilion
When I did a comparison of co-located vs. managed vs. cloud my numbers broken
down as such:

co-lo: 1x cost managed: 1.5-2x cost cloud: 2.25-4x cost

This was with headcount changes figured into the pricing. (We did not see a
head count reduction when using EC2)

The primary advantage the cloud offered was that it was an operating expense
without a contract, and that you could turn systems off when not used.

------
larrys
"You would think that you could get better prices by signing 1 or 2 year
contracts, but interestingly enough, out of the initial 5 providers we talked
to the two that didn’t require contracts had the best prices." (snip) "We’ve
moved 100% of our machines that rely upon performant (sic) disks to dedicated
servers hosted at Softlayer. Roughly speaking, this corresponds to about 80%
of our hosting costs. Eventually, we’ll move everything "

If you don't have a contract there is nothing to prevent a provider from
raising prices on you. The reason to have a contract is _not just_ to get the
best price. It's to have a price guarantee. Edit: Moving 100% of your machines
won't be something you will want to do. If you have a contract you can
renegotiate well in advance of any price increase.

Prices always drop?

People thought housing prices always go up as well.

How long is the price guaranteed for? I'm assuming mixpanel has this issued
covered but it's important to keep it in mind. Not having a contract goes both
ways.

~~~
krobertson
There are other aspects of a contract that can be beneficial:

SLA, including compensation for outages and outs if they have too many. Sure,
without a contract you can leave anytime, but a contract isn't necessarily a
permanent trap. They are negociable. You can push for lower incidence
allowances, an opt-out part way through the contract, and so on.

Support... more in actual human assistance with the move and issues. Depending
on your size and the terms, that contract could be worth 6 to 7 figures to the
company. That is some serious motivation to help make the initial experiences
good and the help along the way.

~~~
biot
Generally "no contract" means no fixed-term (eg: multi-year) contract. There
is always a contract of some sort covering acceptable use, SLA, payment terms,
termination rights, and so on. Otherwise you could use as much as you want for
any purpose and not pay them a cent and they'd have no recourse. You can even
have a contract locking in your price for five years but that you only pay
month to month and can leave at any time without penalty.

------
jedberg
You didn't abandon "the cloud", you just switched providers.

You're still paying someone else for servers that you don't own (unless
softlayer ships you those machines after 3 years).

This is why I hate the term "the cloud" -- because it is too nebulous and non-
descriptive.

~~~
papercruncher
Cloud implies running in virtualized environment. Dedicated implies that only
your bits run on that hardware which is _huge_ for I/O.

~~~
jedberg
Also, you don't necessarily take a huge hit in I/O with virtualization. VMWare
on dedicated hardware with all the virtualization extensions will be pretty
comprable and a lot easier to maintain then straight up raw iron.

~~~
spydum
To be fair, this is only true if the storage back end is appropriately
configured.

You _will_ take an I/O hit when instead of a single physical machine asking
for a set of sequential blocks off the disks, you have 20 virtual machines
asking for seemingly random blocks off the disks.

Replace disks with storage array if you'd like.. but the fact remains: more
VM's will mean more storage contention. If you have the funds to have
dedicated arrays per VM, hats off to you. Most people never do this, and I/O
suffers a penalty. Virtualization has is price, and even that being said, I
think it's worth it for most people.

~~~
asharp
Your point about having multiple VM's forming additional seeks is correct, but
it misses the bigger picture.

Basically, on any reasonably sophisticated hosting infrastructure those arn't
a problem.

If your problem is just seq->random conversion due to additional VMs then
bcache/flashcache does a surprisingly good join at making that just plain go
away with little additional cost.

On any reasonably sophisticated host you have a DSAN, which has a cool little
stats trick, which basically means that as you add additonal vm's together the
variance on the total IO load drops, and the load pattern itself becomes more
and more normal, the larger and more uncorrelated you get.

That gives each vm more 'burst' capacity when required with many fewer
failures (ie. the vm asks for more IO then the current capacity of the
system).

This leads to a bunch of interesting stuff when you try to apply it in real
world systems, either in HPC or in clouds.

------
steve8918
I think currently the biggest problem with the cloud is the inability for
cloud providers to truly estimate and understand their risk, which also means
that customers don't have the ability to understand their risks either.

For example, Amazon could estimate 99.95% downtime, because of physical and
geographical redundancy, etc. But this analysis would be faulty, as their
outage earlier this year showed.

There are a litany of long-tail black swan events that could bring down entire
datacenters, that people just can't anticipate. Not even including
earthquakes, terrorist attacks, etc, but even simple upgrades or
misconfigurations like the one that took down their datacenter in the East
Coast. Yet, they still advertise a SLA of 99.95% availability, etc. Is the
risk to downtime really only 0.05%? Was the event that occurred really a 3
standard deviation event? I highly doubt it.

This complete lack of true ability to estimate risk means that customers also
have essentially an inaccurate view on what their risks are. Like the one
commenter who said that a small business ran their POS device over the cloud,
if you told them that they would be down 2 days out of the year, would they
really be interested in that? Probably not.

In a similar vein, the authors were likely promised great uptime, but no
guarantees on I/O or CPU performance, which is something you don't think of.
The cloud provider doesn't have to be down for your web service to be
drastically affected. I suppose since this is all new, the customers are
learning which questions to ask, and the cloud providers are learning which
things to guarantee, so hopefully this is worked out in the next year or so.

~~~
scottm01
While I agree with you providers obviously have been unable to avoid "long-
tail black swan events" (awesome phrase!), and that too many businesses and
users jump to the cloud without actually understanding their architecture,
availability does not mean what you are implying.

99.95% availability means your site should be "available" 99.95% of the time.
It does not mean you have a 0.05% chance of a disaster, it means you will not
have more then 21.5 minutes per month of outages. Those 21.5 minutes might be
during your most critical time. They might even all be added together for one
4 hour downtime before you're demoing to VCs and still not violate your SLA
for the year.

------
ridruejo
Thanks for sharing the experience. In your particular case it seems the right
decision to move to a dedicated provider. The reasoning was similar to
GitHub's (<https://github.com/blog/493-github-is-moving-to-rackspace>) Cloud
environments make the most sense at either end of the spectrum. It works if
you are Netflix, where the choice is not to purchase a few servers but build
your own Cloud infrastructure ([http://perfcap.blogspot.com/2011/08/i-come-to-
use-clouds-not...](http://perfcap.blogspot.com/2011/08/i-come-to-use-clouds-
not-to-build-them.html)) or if you are getting started, when you are cash-
constrained and there are a lot of uncertainties, allows you to run quick
experiments etc. Take our case, we provide a cloud hosting tool
(<http://bitnami.org/cloud>) but we don't run all of our systems there, they
are divided between "traditional" providers (one of them Softlayer) and AWS.
It is a bit of a pain, but for our current requirements and budget, it works
nicely.

------
jwegan
Rackspace's primary business is dedicated hosting. In fact they let you have
both dedicated hosts and cloud hosts and allow dedicated and cloud hosts to
talk to each other.

I'm curious why they didn't just switch to Rackspace's dedicated hosting. It
would have given them the performance they needed while retaining the
flexibility of being able to quickly spin up cloud machines in the same
datacenter as the dedicated machines.

~~~
powertower
> In fact they let you have both dedicated hosts and cloud hosts and allow
> dedicated and cloud hosts to talk to each other.

That's not amazing at all.

You do realise that a "cloud" host and a dedicated box are two exact same
hardware boxes siting next to each other in a rack? One's just virtualized 10x
with Xen, VMWare, KVN, etc.

~~~
jwegan
Of course its not amazing, but it is a feature they have that would have made
the transition to dedicated easier.

~~~
powertower
Being networked is a feature?

~~~
jwegan
Being in the same datacenter is a feature since it means there is extremely
low latency between your dedicated and cloud machines. Sub-millisecond vs. 10s
of milliseconds round trip times can make a huge difference.

------
Loic
If you are running on top of your dedicated servers, the providers are now all
providing a dedicated VLAN for your servers. This allows you to deploy your
own VM management software on top of your dedicated servers.

If you can, I recommend you to use Ganeti with Xen or KVM (I use KVM).
Rigorous development, very friendly developers and very well designed tools.
No wonder it is used internally at Google.

<http://code.google.com/p/ganeti/> \- Project page.

<http://notes.ceondo.com/ganeti/> \- Notes on how to use it with Debian
(long).

~~~
adgar
> No wonder it is used internally at Google.

Well, I think that's more because it was written at Google.

~~~
Loic
Yes, but, it would be a dead horse, they would have not been keeping it in
production and improving it over the past 5 years. I think this is the key,
well designed, well maintained, well used for critical stuff in a big company
all that over several years.

Edit: forgot part of the sentence, stupid me.

------
ryanlchan
Do you rent your house, or did you buy it?

Cloud is renting servers: low cap-ex but high op-ex, minimal risk exposure,
highly nimble. Dedicated hardware is buying servers: high cap-ex but low op-ex
, more risk, and more consistent.

There's nothing inherently "better" in either strategy; they each suit a
different need.

~~~
kondro
Except in many places in Australia is actually cheaper than buying because
poorly-educated investors have been sucked into the dream of financial freedom
through owning an asset that _always_ increases in value.

------
kqueue
This service provider has great pricing compared to softlayer

[http://www.hetzner.de/en/hosting/produktmatrix/rootserver-
pr...](http://www.hetzner.de/en/hosting/produktmatrix/rootserver-
produktmatrix-ex)

Intel Corei7-2600 Quad core + 16GB DDR3 + 2 x 3TB 7200 for 49 euro.

~~~
AdamGibbins
The quality of service and network is also widely different and varied.
Hetzner are awesome for the cost, don't get me wrong. But at times their
network is terrible and their service seems to vary hugely.

You don't get these problems with Softlayer, you pay significantly more and
get significantly better almost-guaranteed service.

~~~
kqueue
I see. I haven't tried them personally, my friend has been using them for a
year now and so far so good. He did complain about their support though.

~~~
james33
We've been using 100TB exclusively for over 2 years now and absolutely no
problems. Their support has been top-notch in my opinion.

------
bryanh
Just about everyone and their mom hopped onto the cloud bandwagon there for
good time, but with this steady onslaught of praise for bare-metal hosting,
maybe this will reverse a little.

It seems like the only thing cloud really provides best is for:

    
    
        1) Short lived instances or "now" instances.
        2) and... what?
    

I'm trying to think of other situations where cloud beats bare metal, but I am
coming up short.

~~~
noodle
> I'm trying to think of other situations where cloud beats bare metal, but I
> am coming up short.

smaller businesses where economies of scale don't kick in, and/or smaller
businesses that want to hedge their bets on growth.

~~~
crag
Except of course when that small business relays to much on the cloud. I know
a local gym that decided to "host" their cash register. I told the owner it
was a mistake. He basically said I was suck in the past.

Until his Comcast went/slowed down.

Or a local non-profit who's board came up with a great way to save money -
host their phone system in the cloud. A local carrier was happy to sign them
to a 3 year contact. Even supplied the 42 phones. Now, they lucky if they can
make calls mid day. It's so bad, that if 10 phones are in use, the next call
will sound like you calling from a wind tunnel. And forget about calling at
peak times. What does the carrier suggest? Upgrading to a T1. Of course that
carrier never mentioned this when selling the service in the first place. And
personally, with 42 phone plus 50+ computers and other devices I'm suggesting
T3 (cost down here about 500 - 600 a month).

My point is, our infrastructure (at least in South Florida) isn't there yet.
Sure the cloud is a great idea. But if you can't reach it, it's useless. But
that doesn't stop the marketing. Or the complaints.

~~~
noodle
those are examples of people making (or being sold on) poor choices and/or not
having all information necessary to make an informed choice.

the cloud is not for everyone. physical servers aren't for everyone, either.
stories like these don't automatically imply that using the cloud is a bad
idea for everyone.

~~~
crag
"informed choice"

No, no. It's about price.

Most small business buying subscribing to these services aren't tech. In the
phone market (as an example), carriers are selling hosted "solutions" for less
then $75 a month. Comcast is too. I like Comcast. But you can't run an office
with 25 phones and pc's and other devices on Comcast. At least not in Florida.

The other problem, is most IT firms down here are pushing they own "hosted
solutions". Everything from email to accounting services. For cheap.

Now let me be clear some services I think make perfect sense in the cloud.
Even with unreliable connectivity. Like email, storage, messaging, But your
core business, services you must have to run your business need to remain
under your control. Period.

And the last thing; many small business don't really understand just how
important IT is to their business.

~~~
noodle
if it was purely about price, then these people would've purchased these
options regardless of the shortcomings if they had known about them
beforehand. and if this is the case, they went in with eyes wide open and the
cloud _is_ right for them.

~~~
crag
I doubt that. Trust when I tell you this, the carriers (and I'm including
Comcast) sales people do not tell the customer the downsides and limitations.

If you don't believe me. Try it. Call the business sales units of the
carreirs.

Now I believe in buyer beware. But that's the problem... most small business
are suffering cash flow problems. When something cheaper comes along there is
no "buyer beware". They think about lowering their monthly bills. An of
course, in the end they get bit in the ass.

It's just mind-blogging to me how many business owners are so ignorant about
the tech that runs their business.

~~~
noodle
> It's just mind-blogging to me how many business owners are so ignorant about
> the tech that runs their business.

in my mind, this is the very definition of "uninformed choice". you don't have
to be uninformed on purpose, you could also be kept in the dark intentionally
by salespeople. you're still making a choice based on an incomplete picture.

------
BrandonCWhite
Thanks for this post. I currently own/run a niche social networking site with
a little over 60,500 registered users and see several thousand users on the
site at a time. I bought my own servers co-locate them in a large hosting
facility in northern Virginia. The initial costs were the hardware, of course
you can lease it if you want which can be cheaper if you intend to upgrade
existing servers on a yearly basis. We do not need to do that quite that
often, so buying them outright has been economical for us.

The ability to customize our boxes has been a big advantage for us and given
the hosting facility has all the redundant power sources and bandwidth pipes
we never see any problems. I will mention that most of our traffic is east
coast based and given our servers are on the east coast we have not seen any
problems. If we see traffic expand we would look to put some boxes on the west
coast or midwest.

At one point I looked into us switching into the Cloud with AWS and and
Rackspace, the costs were much more then we pay now.

In regards to bandwidth, most of the clouds pricing I have seen are based on
total usage, our bandwidth is based on the 95 percentile usage. And it's not
capped, so if we have a spike of 20mg/sec the pipe is open to fulfill it. The
95% pricing model as worked very well for us. We average a few mgs/sec and our
bandwidth costs are under $50/month. I'd add to author when he talks about
negotiating, do it, you can get a great deal (s).

I looked into AWS for another start up I am doing in the communications space
and we tried it, for not a lot of users on the cloud it was very expensive. We
moved to Rackspace and have limited our alpha users to $100, it's still
expensive and as we move to launch over the next year we will go with
dedicated servers.

Thanks for the post. Brandon

~~~
charliesome
Yes, bandwidth is dirt cheap on a dedicated box.

I run a service that constantly pushes over 90mbps over the wire (about 30TB a
month) and I pay just over $100 a month for two servers. The same bandwidth
usage on EC2 (or any other 'cloud' provider for that matter) would cost me
thousands.

~~~
blantonl
100tb.com?

------
Joakal
Interesting Rackspace leaving post:

"Rackspace Cloud has had pretty atrocious uptime over the year there has been
two major outages where half the internet broke. Everyone has their problems
but the main issue is we see really bad node degradation all the time. We’ve
had months where a node in our system went down every single week.
Fortunately, we’ve always built in the proper redundancy to handle this. We
know this will happen Amazon too from time to time but _we feel more confident
about Amazon’s ability to manage this since they also rely on AWS._ "

There was some statements from Amazon employees that Amazon isn't hosted on
AWS.

~~~
asharp
Amazon.com was only recently transitioned, but AWS was used to host ecommerce
apps for partners, iirc. before it's public debeau. This is what it was built
for.

------
proofpeer
I think you have to distinguish between where you rent whole server
capabilities, and more managed models like Google App Engine. It seems to me
that GAE should deliver more consistent performance.

~~~
cr4zy
GAE definitely does not deliver more consistent performance. I've done a ton
of performance tweaking on GAE as of late and my bottlenecks are now reduced
to random points in the code between RPC's where my Python thread is obviously
locked out of the CPU it was running on.

I should say that the variation this leads to is at max around two seconds. I
believe this is due to App Engine doing some dynamic grouping of slow
applications. So if your app has fast response times, it will be grouped with
other apps having fast response times, so the maximum downside is limited.

~~~
proofpeer
That's interesting. Do you think you get better performance by using Java
instead of Python? I heard bad things about Python and the GIL.

~~~
cr4zy
If you enable multi-threading for Java, yes. Although they are releasing
multithreading with Python 2.7 in 5 days. This statement in their optimization
article for the new rules is also interesting:

 _Muli-threading for Python will not be available until the launch of Python
2.7, which is on our roadmap. In Python 2.7, multithreaded instances can
handle more requests at a time and do not have to idly consume Instance Hour
quota while waiting for blocking API requests to return. Since Python does not
currently support the ability to serve more than one request at a time per
instance, and to allow all developers to adjust to concurrent requests, we
will be providing a 50% discount on frontend Instance Hours until November 20,
2011. The Python 2.7 is currently in the Trusted Tester phase._

I've also noticed my app's speed pick up dramatically in the past few days.
Perhaps because people are leaving before the new billing takes effect.

I don't mind the GIL much because I can just make a request to get a new
thread going. :)

------
physcab
This is somewhat unrelated, but I remember reading in one of Mixpanel's job
posts that they had over 200 servers. 200 for a company of their size that
charges by the data point seems kind of a lot. I've worked at a couple tech
companies who get by with an order of magnitude less servers and deal with the
same load that I bet they deal with. So either they were exaggerating by re-
defining what a "server" was in the cloud, they have tons of (costly)
freeloaders, or their infrastructure is inefficient.

~~~
jbyers
Or you bet wrong about the load they deal with.

They may also have higher availability requirements than most companies and
need 2X (more?) the infrastructure to protect against a data collection
failure.

They may be counting nodes used periodically, e.g. a large Hadoop map-reduce
run.

Edit: don't get me wrong -- 200 servers is a lot. :)

~~~
physcab
I wouldn't doubt it. But they also don't publish any figures so its difficult
to confirm. I work for one of their competitors and we most likely have the
same availability requirements... anyways, just curious. Here's where we're
at, as a comparison <http://bit.ly/qLrKOt>

edit: looks like they did publish some figures :)
[http://techcrunch.com/2010/07/01/mixpanel-billion-
datapoints...](http://techcrunch.com/2010/07/01/mixpanel-billion-datapoints/)

------
asharp
Interesting.

Also interesting what is simply an artifact from the fact that none of the
current "clouds" out there were built to deal with, well, actual loads.

Some things are small, but seem rather strange. Why does no cloud give out
95/5 billing? Why isn't there more resource limiting/etc?

I see a bunch of things leaking out of EC2. People forget that EC2 was
designed to deal with large numbers of stateless servers and it's not good for
much else. They take the limitations of that and the rest of the AWS platform
and apply it to the 'cloud' overall.

Two examples would be from the 'variability' section. CPU limiting under XEN
(the hypervisor used by both Amazon and Rackspace) is trivial. The fact that
CPU is so variable, especially for smaller tiers, is thus rather interesting.

Similarly with IO. With Rackspace, you are on local disks. As such, unlike
Amazon, Rackspace has no defendable reason for being able to starve other
users of disk IO.

Also, just as a general data point. There is no real reason why a cloud should
be in the same order of magnitude of cost as anything you could touch. Fairly
simple reasoning, everything they buy is at massive scale, and there is a very
minimal fixedish management cost to deal with all the hardware. What you can
work out is that even given almost list prices you are still looking at
thousands of percent ROI on cloud servers. What that then says about the
market is that there is a current monopoly due to a lack of cloudsmithing
knowhow which is the cause of the current situation. Over time, I would expect
cloud products to simply dominate standard dedicated servers/colocated servers
for most applications.

------
joevandyk
One thing I love about ec2 is that I can easily test/debug/modify my
provisioning recipes (written in chef) from a blank slate.

With fog (<http://fog.io/1.0.0/index.html>), I can start up a new ec2 instance
in less than a minute and tell it to run the chef process. If it doesn't work
properly, I shut it down and try it again.

How does that work on a dedicated machine at, say, SoftLayer?

~~~
thomaspaine
I use unmanaged dedicated servers with Serverbeach, but I assume SoftLayer has
similar tools. If I totally screw up my server there are tools to boot in
rescue mode or just wipe the machine and do a clean OS install.

For testing out puppet processes I use Vagrant with VirtualBox on my local
machine.

------
MichaelApproved
_"We recently added a new backup machine with a crappy CPU, little RAM, and 24
2TB drives in a hardware RAID 6 configuration. You can’t get that from a cloud
provider and if you find something similar it’s going to cost an order of
magnitude more than what we’re paying."_

I tend to agree with his points but, for back ups the cloud is perfect. If he
stayed in the cloud, he wouldn't even need the server in question.

~~~
omfg
That's assuming the per GB monthly charge works out cheaper than the up front
investment in a machine like that. Which these days is rather small.

------
jbrendel
Engineering is always about making the right compromises. Why does it have to
be 100% dedicated or 100% cloud? What most people don't realize is that it
really is a continuum.

For example, why not run the disk performance sensitive DB server on a
dedicated machine, while fronting the whole arrangement with proxies and app-
servers hosted in the cloud? Ok, so there are latency considerations to be
made, but you can see that mixed architectures can make sense.

I think what's stopping people from considering this is that there haven't
been good cross-provider network virtualization solutions available. But if
you could create your own network topology and your own layer 2 broadcast
domains, no matter where your machines are located, things are starting to
look up.

There are a number of network virtualization providers out there now, which
you might want to look at to see what's possible. Disclaimer: I work for
vCider ( <http://vcider.com> ), which provides solutions for on-demand
virtualized networks, which can span providers and data centers.

~~~
raylu
"For example, why not run the disk performance sensitive DB server on a
dedicated machine, while fronting the whole arrangement with proxies and app-
servers hosted in the cloud?"

That's actually exactly what we do:

"We’ve moved 100% of our machines that rely upon performant disks to dedicated
servers hosted at Softlayer. Roughly speaking, this corresponds to about 80%
of our hosting costs."

------
endeavor
I'm surprised no one has mentioned AWS dedicated instances:
<http://aws.amazon.com/dedicated-instances/>

20% more expensive but it seems like the easiest way to fix the problem if
you're already on Amazon.

That said, I totally agree that Cloud-based IaaS is not a good fit for every
situation.

~~~
lfittl
Don't forget the additional $10/hour "region fee" (= $7200/month)

And that dedicated instance is likely still running inside Xen, so you got the
normal virtualization overhead, and slow disks.

------
grandalf
The disk IO problem he mentions might start to go away once SSD drives go into
wider use by cloud providers. As far as I can tell, that is his main beef.
It's certainly a reasonable complaint.

However the point about pricing is less valid. Cloud hosting providers must
invest in lots of extra infrastructure to allow for the flexible provisioning
they offer, so any comparison that assumes no need for that flexibility is
flawed.

Amazon offers spot instances and various other pricing innovations to help
align the customer with Amazon's internal provisioning risk.

I could see Amazon offering lower prices if the user commits to longer term
provisioning. This is a simple pricing update that would likely negate any
cost advantages of non-cloud services.

The bleeding edge hardware aspect of his argument is valid for some businesses
but not likely applicable to most.

------
w1ntermute
If this trend picks up, there's a good business opportunity if someone comes
up with a way to combine the best of the cloud with the best of dedicated
hosting.

~~~
oinksoft
Like Linode?

------
tabizzle
Can the multi-tenant problem be solved by just using the beefiest EC2 instance
available? At some point don't you become the only one occupying that box? And
if your site has the volume that MixPanel does, I assume you wouldn't be
exposing yourself to single-point-of-failure issues because you'd still have
many such boxes. Can someone more knowledgeable please address these?

~~~
fleitz
It could, but it can also be solved reliably by buying your own hardware
tailored to the needs of your specific application.

Single point of failure issues aren't solved by the cloud. Their solved by
eliminating single points of failure. 1 VM is just as much of a single point
of failure as one real machine.

Even if it were true about buying the beefiest VM you're betting your company
on an implementation detail that you have no control over.

------
code_duck
Yes, my experience with the problems of performance on virtual servers is disk
related. That's great that you get guaranteed CPU, memory, bandwidth etc... if
you're getting 3 MB/s disk throughput, it doesn't matter, your site will slow
to a crawl. I moved away from Slicehost for this reason, and have never had
such issues with Linode.

------
snorkel
Since their app is highly optimized, profiled, and tweaked low level C then
it's no wonder they could not tolerate the CPU variances of the cloud. Even so
at least on EC2 there's much less noisy neighbor issues on the bigger CPU
instances, for example on extra large instances you essentially have the
entire server to yourself.

------
hasanove
Not to mention, that if you _really_ need to, you can spin cloud instances at
Softlayer as well. We are running on dedicated hardware most of the time, but
if we anticipate a temporary and significant influx of the traffic, we run a
few additional pre-built Cloud Computing Units (as they call it) and are ready
in 20 minutes.

------
SkyMarshal
Good writeup. Key part (lessons learned):

 _"After deciding to go dedicated, the next step is choosing a provider. We
got competing quotes from a number of companies. One thing that I was
surprised by — and this really doesn’t seem to be the case with the cloud — is
that pricing is highly variable and you have to be prepared to negotiate
everything. The difference between ordering at face value and either getting a
competing quote or simply negotiating down can be as much at 50-75% off. As an
engineer, this type of sales process is tiring, but once you have a good feel
for what you should be paying and what kind of discount you can reasonably
get, the negotiations are pretty quick and painless.

We ultimately decided to go with Softlayer for a number of reasons:

\- No contracts. I don’t think I really need to explain the advantage. You
would think that you could get better prices by signing 1 or 2 year contracts,
but interestingly enough, out of the initial 5 providers we talked to the two
that didn’t require contracts had the best prices.

\- Wide selection. Softlayer seems to keep machines around for a while and you
can get very good deals on last year’s hardware. Most of the other providers
we contacted would only provision brand new hardware and you pay a premium.

\- Fast deployment. Softlayer isn’t quite at the cloud level for deployment
times, but we usually get machines within 2-8 hours or so. That’s good enough
for our purposes. On the other hand, a lot other hosting companies have
deployment times measured in days or worse.

One last thing about getting dedicated hardware. It’s cheaper… a lot cheaper.
We have machines that give us 2-4x performance that cost less than half as
much as their cloud equivalents and we’re not even co-locating (which has its
own set of hassles)."_

------
traveldotto1
There are some cloud providers where you can allocate dedicated servers with
virtualization on top. That way you can manage exactly what runs on each
instance, while still have the flexibility to allocate more server instances
quickly for handle growth.

------
joshaidan
The disk problem is always something I've always been fighting with when it
comes to virtual private servers. I've never had to do so much optimizations
with dedicated servers as I had to do with VPSs.

~~~
dmpk2k
The disk problem is maddening, and I used to be a strong proponent of
dedicated hardware for this reason.

However, the provider I use (Joyent) recently added some kind of disk
scheduling that prevents these problems. I don't know how they do it, but
hopefully more cloud providers do something similar.

------
joevandyk
I'd be interested if they were only using EBS storage. I wouldn't use EBS for
anything that was latency sensitive. I've found instance-only storage to be
much faster and consistent.

------
callmeed
It's interesting–I would (and did) choose Rackspace for dedicated hardware
over SoftLayer any day.

RS is more expensive but the extra management and support you get is well
worth it IMO.

------
foobarbazetc
Holy shit, someone finally gets it.

AWS is a gigantic money pit. SoftLayer is the only way to go, IMHO.

------
rythie
This would be made a lot better if you could buy VPS with a dedicated disk
pair.

------
rayhano
Finally! A balanced explanation of the differences

------
david_a_r_kemp
tldr: share hosting sucks

------
powertower
> One last thing about getting dedicated hardware. It's cheaper a lot cheaper.
> We have machines that give us 2-4x performance that cost less than half as
> much as their cloud equivalents and we're not even co-locating.

I've been saying this for ages, and every time people would fall over
backwards trying to defend/prove their cloud mistake...

"The cloud is cheaper, faster, and infinitely scalable."

Except none of those 3 is true for any real world use case, but a few.

The moment a popular site like Reddit switches to the cloud, is the moment it
becomes barely usable during certain times of the day.

~~~
ChuckMcM
So I'm running operations at Blekko, and just prior to that I was at Google in
their eng/ops organization. I had been doing a whole lot of 'total cost of
ownership' aka TCO computations around engineering infrastructure both for
Google and of course now for Blekko.

The conclusion I came to is that for a 'web 2.0' type setup, the break even
point was about 500 'machines.' That was in part because a 'machine' today has
8 - 24 'threads' and 2 - 40T of 'storage' and (at the time) 2 - 96G 'memory.'
So in terms of 'cloud' you could easy run 10 "instances" on these sorts of
machine. So 500 machines might be 5000 'instances' in an AWS type cloud.

Its this '10:1' multiplier effect (which is only getting better with bigger
machines) and the management techniques of running the same config everywhere,
etc. Means your TCO goes up more slowly than the capacity of the resulting
infrastructure, so you can 'solve for x' where the two lines cross to identify
the break even point. Everything east of that point you're coming out ahead of
a 'cloud' based deployment.

What is still a challenge however is geographic diversity. If you wanted to
put 500 machines 'around the world' so 125 machines in each 90 degrees
(approximately) of longitude, the economics of getting 5 - 10 'cabinets' in
places around the world can work against you. (you have more negotiating power
if you're putting in 100 racks than if you are putting in 10 racks)

~~~
rachelbythebay
Beware the "same config everywhere" approach. It works up to a point, then it
turns into a disaster. All you need is one totally broken change like "chmod
-x /usr" to really make life interesting. You start bleeding machines and
pretty soon you have nowhere left to host your tasks.

It's interesting, right? At first, you can handle a couple of totally mixed-up
machines. Then it stops scaling and you have to start doing the whole "golden
+ syncer" approach.

Then you go too far and get into a monoculture. When the machines _do_ break,
it's impossible for humans to go around and fix them in any reasonable amount
of time because there are too many. It's amusing when this happens and the
solution put forth is "more administrative controls".

~~~
asharp
You just need to do rolling deployments to make simple things like that go
away.

Roll out a deployment to N machines (like say 10), run self checks (you have
those, right?), if everything passes give them standard load. Over the next K
period of time, periodically check up on them. After that, roll over to N*2 or
N^2 nodes, continue until you have rolled out to your entire cluster.

~~~
rachelbythebay
Some overly-confident human went "oh, this can't be bad" and force-approved
it. This really happened. Multiple times. And they have rolling deployments.

Their response? More administrative measures.

~~~
asharp
Sigh. Some things never change.

