
How and Why Swiftype Moved from EC2 to Real Hardware - quicksilver03
http://highscalability.com/blog/2015/3/16/how-and-why-swiftype-moved-from-ec2-to-real-hardware.html
======
snoopybbt
There's not much need for a fancy article on a fancy website in order to
understand a key concept of cloud computing:

Cloud computing offers you the great and awesome advantages of being able to
instantly scale your application, replicate your data and basically just grow
according to your business volume, and all this without significant
investments, delivery time, setup time, people time, maintenance or anything
but it's expensive in the long run.

And this is OKAY, this is GREAT.

Once you're big enough, you know what your load is now and what your load will
likely be, and you know _exactly_ what you need now and (approximately) what
you're going to need in the near future, setting up your own datacenter is
way, way more effective.

Amazon does not get free electricity, free servers and/or free people time. Of
course, you're paying that, and you're also paying Amazon's profits.

This is absolutely fine, as long their service fits you.

But when you grow enough, put simply, your needs change. It's just that.

~~~
rodgerd
> There's not much need for a fancy article on a fancy website in order to
> understand a key concept of cloud computing:

I wish it were true, but plenty of companies are gripped by cloud fever. I've
seen quite a few going down the route of charging into the cloud not because
they've run the numbers and found it stacks up, but because they want to be in
the cloud, and Amazon have some great marketing people.

~~~
AndyNemmity
Often times companies move because their organization doesn't deliver, and the
hope is a cloud company will do a better job of it.

If you have a great team, I firmly believe hosting yourself is far, far, far
less expensive.

If you have a terrible team, then cloud (hosting) is less expensive. Even if
it was exactly the same cost, you're gaining by not having to have a staff to
run it, and the costs of managing them, etc etc.

Most places don't have great teams. Insert random corporation here likely has
a team that is a mess for whatever reasons happen in large companies.

In that case, the Cloud makes a ton of sense for them. They've already screwed
up their own organization in some way, and this is a large reset button on the
whole thing.

That's worth a ton in itself.

~~~
emodendroket
If you're a small org with relatively small volume self-hosting doesn't make
sense.

~~~
toong
Can you elaborate why not ?

You can get some cheap dedicated hosting (ex: [1]) for a fraction of the
price.

It's so cheap compared to AWS you can order a few spare ones and still come
out cheaper than your one beefy AWS instance ?

The only way it doesn't make sense is, if you need to scale up and down very
fast ?

[1] 60 euro/month: Quad-Core Haswell, 32 GB (non-ECC) RAM, 240Gb SSD @
[https://www.hetzner.de](https://www.hetzner.de)

~~~
berkay
The cost of the systems is almost irrelevant. The time of qualified people is
far more expensive. Sot it's not just the elasticity. You have to look
everything from the time perspective as well. What additional knowledge will I
need? Will I need to learn load balancers, how to make them highly available?
will I have to learn about SSL certificates and termination? Will I need to
learn how to implement and operate a secure and highly available DNS service?
How much is your time worth? Our hosting costs are a fraction of a single
worker's salary. Whether AWS is more expensive is essentially irrelevant. OP
argues that it was not doing the job for them which is an entirely different
issue.

~~~
emodendroket
Don't forget the quality of the work. Frankly, if I'm administering servers
I'm not going to do as good a job of it as someone whose whole job is that and
I'm not going to do as good a job as Azure or AWS or whoever else either.
Platform as a service is the way to go.

------
stephen-mw
Here's my anecdotal, one-data-point experience from moving a giant EC2
environment to datacenter:

1\. Your operational overhead will increase _a lot_. Be ready to hire on a lot
of ops staff if you expect them to do anything but put out fires. And as you
grow you'll need experts, people like network engineers.

2\. Any weirdness you experienced with AWS infrastructure will be replaced
with weirdness in your own environment, except now you're on the line to
troubleshoot and fix it yourself.

3\. Operation staff will immediately start guarding the food bowl as resources
become finite. Server provision waits start to seem like breadlines. Power is
consolidated with Those Whom You Must Ask.

4\. Your cost will decrease, sometimes significantly so.

5\. You'll have more hardware flexibility to run your app just the way you
want to (Stack Overflow's mega databases come to mind[0]).

In the end I think this type of transition is for stable companies that don't
mind or even prefer strong divisions of labor (coders who code, sysadmins who
sysadmin, testers who test), but it's not for startups or companies that hope
to move with any kind of strong velocity.

[0] [http://highscalability.com/blog/2014/7/21/stackoverflow-
upda...](http://highscalability.com/blog/2014/7/21/stackoverflow-
update-560m-pageviews-a-month-25-servers-and-i.html)

~~~
chx
> Your operational overhead will increase _a lot_. Be ready to hire on a lot
> of ops staff if you expect them to do anything but put out fires. And as you
> grow you'll need experts, people like network engineers.

Why? What are you talking about? You are hiring servers you are no colo'ing
them. Networking them is not your problem. Your responsibility still starts
from a root prompt just there's no VM layer between that and the physical
server.

~~~
merb
and if you have more than one? or more than one dc? somebody needs to connect
it, or you will need a VPN. it's not that easy without Clouds if you need
connected servers. We switched to AWS since connecting servers in a dc isn't
as cheap as people think of.

~~~
corford
I host all my stuff with Hetzner (who are at the cheap end of the market
compared to Softlayer) and even they provide pre-configured VLANS and private
switches. For intra-DC you can just setup some redundant openvpn links or pay
the dc to configure a hardware tunnel between both sites.

------
bobofettfett
My 5cent:

If you have lots and lots of money and a high margin business, do yourself a
favor and go with Amazon (much less hassle with contract management and low
level challenges).

If you need to scale month to month and are growing 50% per month, go with
Amazon.

If you are very small and can live with 10 instances, go with Amazon.

If CAPEX doesn't help you and for whatever reasons you need to spend OPEX, go
with Amazon.

If you need many (types of) machines for failover but which otherwise mostly
idle, go with Amazon.

Otherwise it's always cheaper to buy or rent hardware. Amazon is very
expensive (TCO).

If you base your decision on hype, you're screwed.

* Amazon stands for Cloud Provider, personally I'm choosing Digital Ocean with Mesos/Docker.

* Except S3 which is a no brainer to use.

~~~
chx
> If you need to scale month to month and are growing 50% per month,

Then rent more servers.

> If you are very small and can live with 10 instances,

Then rent a few servers.

I maintain that there are extremely few cases for a typical website to use the
cloud. To handle peaks, it is both simpler and cheaper to keep enough capacity
just idling around than spinning up and down Amazon instances. The cloud is
almost always a useless hype. It can be different if you can architect to use
the various services Amazon provides.

~~~
bobofettfett
From my experience renting more servers with 50% growth is a challenge. A lot
of things go wrong when installing a lot of servers each month.

Also from my experience, with 10 instances the money you save with custom
servers is negligible and contract and SLA management, multi datacenter etc.
is easier with a cloud provider than renting servers. At least where I've
rented servers in the past.

~~~
Negitivefrags
I'm not sure where you see a difference between a VM and physical hardware
when it comes to provisioning.

Sure, the physical hardware takes 1 hour rather than 1 minute to spin up, but
the process is otherwise entirely identical.

~~~
chx
1 hour? Welcome to 2015! OVH spins your server up in two minutes.

------
mattbeckman
"With Amazon we experienced networking issues, hanging VM instances,
unpredictable performance degradation (probably due to noisy neighbors sharing
our hardware, but there was no way to know) and numerous other problems. "

Why do I get the feeling it was kind of a cop-out to just pack up and move
without finding the root cause? I've seen it plenty of times: the "best
solution" is to just find a different hosting provider.

In my experience, I've never found an issue with an application on AWS that
wasn't caused by either a misunderstanding of what was being offered (e.g. not
provisioning enough PIOPS for database volumes), or simply issues with the
application code.

~~~
toomuchtodo
> In my experience, I've never found an issue with an application on AWS that
> wasn't caused by either a misunderstanding of what was being offered (e.g.
> not provisioning enough PIOPS for database volumes), or simply issues with
> the application code.

You haven't been using Amazon long enough then.

Amazon is great for proof of concept. No upfront costs, extremely scalable,
etc. Unfortunately, its expensive compared to physical hardware once you get
to scale, and you may never solve underlying performance issues due to it
being a shared tenant environment, even if you're a Netflix-sized customer.

~~~
TheMagicHorsey
But doesn't Netflix successfully use AWS now?

Its possible they get special treatment if they are big enough (nobody else's
jobs on their physical machines ... or something like that).

~~~
Thaxll
Netflix uses thousands of instances, they don't share servers.

~~~
babo
Using thousands of servers doesn't mean that you don't share physical servers.

~~~
babo
Even if you own all the VMs on a physical server it's hard to avoid the noisy
neighbour problem. Netflix has its own instance monitoring tool - Vector - to
handle these issues: [http://techblog.netflix.com/2015/02/a-microscope-on-
microser...](http://techblog.netflix.com/2015/02/a-microscope-on-
microservices.html)

------
gtrubetskoy
This is a no-brainer if you've ever done anything at scale. The explanation is
rather simple - hardware is always "on the premise", yours or Amazon's.
Someone needs to swap drives, motherboards, man the networking gear, run
cables, etc. Amazon doesn't really get a break on the hardware cost because
10,000 servers do not cost less per server than 100 servers (in fact they cost
more as the volume goes up if you need them to be identical). When it comes to
labor cost - if you have enough hardware for at least one full time datacenter
tech, you're in the same boat as Amazon.

So you're paying Amazon to do the same work you would do otherwise - only
you're subject to their rules and procedures and Amazon being a profitable
business needs to mark their services up.

~~~
makmanalp
> The explanation is rather simple - hardware is always "on the premise",
> yours or Amazon's. Someone needs to swap drives, motherboards, man the
> networking gear, run cables, etc.

> So you're paying Amazon to do the same work you would do otherwise - only
> you're subject to their rules and procedures and Amazon being a profitable
> business needs to mark their services up.

But I thought that they were paying Softlayer to do that stuff instead of
Amazon. They're _not_ doing it themselves - and yet it's still cheaper!

~~~
babo
I would like to know the cost calculation after a year or two. With a handful
of servers it's easy to get the false impression that HW failures are rare.

~~~
kovyrin
Oh, there wasn't a handful of servers after we finished the migration (we have
migrated a bit late IMO, so we had a lot of traffic even back then). And
today, with much larger infrastructure, with hardware clusters specifically
tailored to our customers needs, etc I'm pretty sure the same infrastructure
on EC2 would cost more than 2x.

(Update) Re: failures - with a ~50 servers we see a hardware issue (disk dead
in a RAID or an ECC memory failure) about once a month or so. None of those
failures caused a single outage (RAID and ECC RAM FTW) so far.

~~~
babo
How do you monitor HW and network failures and how do you notify SoftLayer? Is
that 1-2 hours replacement time true for each components of your server fleet?

~~~
kovyrin
1-2 hours is their new server provisioning time. For HW issues we use nagios
(that checks raid health and ECC memory health regularly) and at the moment we
just file a ticket with SL about the issue showing them the output from our
monitoring. They react within an hour and HW replacement is usually performed
within an few hours after that (usually limited by our ability to quickly move
our load away from a box to let them work on it).

------
hashtree
A concrete example of how I saved a few hundred thousand dollars over the AWS
by building quarter rack colocation setups with SuperMicro servers:
[https://gist.github.com/rockymadden/5561377](https://gist.github.com/rockymadden/5561377)

With Ansible, I spend no more than an hour a week, amortized, maintaining both
the hardware and administration. I assume nodes for any specific role will
fail, I only scale horizontally, I always have redundancy for every role, I
stay off disk as long as possible (heya 512GB RAM redis cluster), etc.

~~~
danieltillett
Great resource. Are you going to keep this updated?

------
tschellenbach
From my experience AWS is one of the most reliable hosting providers. It's
extremely easy to setup a fault tolerant infrastructure using cloudformation,
puppet/chef etc, boto and a bit of autoscaling. The only disadvantage AWS has
is costs. In many cases it's more expensive than a traditional hosting
provider. On the flip side, your engineers will be more productive.

What I miss in this article is any details on why they had issues with AWS.
You can't just say it wasn't reliable and not explain the details. AWS works
for all of the world's largest startups, why didn't it work for Swiftype?

------
kovyrin
Author here. Happy to answer any questions.

~~~
theg2
Using EC2 here for nearly 2 years and you mention I/O problems and instance
outages 2-3 times a week. Which size instances were you running?

I ask because other than the VM security updates, none our instances have
these sort of issues and some of them have a VERY long life (not ideal we
know). I understand the cost savings and the rest of the reasoning but in my
experience EC2 isn't THAT unreliable.

~~~
kovyrin
Oh, I know what you're talking about. We too had some instances (actually, a
lot of those) that would run for a year with no issues. The problems started
around the time you tried to push EC2 instances beyond an "idle, handling some
requests just to keep from falling asleep" state. Pushing IO (even with
provisioned IOPS) caused random IO stalls, pushing CPU caused REALLY uneven
performance, etc.

And the only solution provided by EC2 support was always to buy more instances
to keep them cold and happy. The problems with that approach (just to name a
few): the cost (for a young startup burning money on idle infrastructure like
that is not very wise IMO) and the fact, that the time to design, develop and
deploy scale-out approach for each of your backend services is the time you
could have spent trying to build your product (again, startup-specific; you'll
have to think about across-the-board 100% scalability at some point).

~~~
micahb37
First: I work in Startup BD at AWS (disclosure), but have been a multi-time
founder as well. I was under the impression that an AWS architect will sit
with you to optimize your infrastructure (Business Support). Did that not
happen / or was it not useful? Happy to help in any way I can.

------
icedchai
I'm convinced hybrid cloud is the way to go. Anything needing high IO
performance should be on dedicated. Anything needed CPU/memory elasticity
(worker nodes, etc) should be in the cloud. Assuming you can get low latency
connectivity into AWS with DirectConnect, this might work?

~~~
fleitz
Cpu also generally sucks due to cache thrashing on non-trivial datasets

~~~
icedchai
Heh. So what is a good use case for EC2?

~~~
fleitz
When you have highly variable load.

eg. Netflix probably spins up thousands of servers for a few hours.

------
rynop
What specific piece saw the biggest boost? My guess is MongoDB. Also if new
servers take 1-2 hours, you are always paying for what you "think" will be
capacity for peak load correct? How do you handle events that quickly and
drastically increase load or txn/sec?

~~~
kovyrin
Actually, it depends. Stability and performance wise, I'd say our Lucene-based
search layer has seen the most impressive jump. But yeah, Mongo instances
loved the new fast IO as well :-)

------
amazon_not
The biggest problem with AWS is the outrageous cost of bandwidth. Even if you
ignore all the other cost differentials, the bandwidth charges will kill you
at scale.

Unfortunately cloud computing, or at the very least AWS, overpromises and
underdelivers at scale. All in all economically viable use cases for cloud
computing are very few and very specific at scale.

~~~
wmf
Since people are sharing their experiences in this thread, has anyone tried
using Direct Connect to get cheaper bandwidth?

~~~
amazon_not
There aren't a lot of savings to be had. In addition to paying for the Direct
Connect infrastructure you still have to pay Amazon per GB charges, albeit at
a slightly lesser rate.

The only real solution is to move the bandwidth usage off AWS.

------
nanoGeek
One question. If real hardware was always 50% cheaper then AWS wouldn't have
been such a success. Can you please explain in which scenarios it makes
economical sense to use AWS compared to real hardware?

~~~
jarjoura
AWS is always going to include a premium because they take care of the DevOps
portion of your infrastructure. There are plenty of virtual hosting companies
that cost significantly less than dedicated hardware, if you won't need all
the bells and whistles.

~~~
moe
_because they take care of the DevOps portion of your infrastructure_

Sorry, but that is mostly a lie.

Running a non-trivial app on EC2 is significantly _more_ complex than doing
the same on (rented) bare metal. Scaling to a _massive_ size can be easier on
EC2, but only after you paid a significant upfront cost in terms of dollars
and development complexity.

Is your app prepared to deal with spontaneous instance hangs, (drastic)
temporary instance slowdowns, sudden instance or network failures?

Did you know that ELBs can only scale up by a certain, sparsely documented
amount per hour?

Or that you need a process to deal with "Zombie" instances that got stuck
while being added/removed to ELBs (e.g. the health-check never succeeds).

Or that average uptime (between forced reboots) for EC2 instances is measured
in months, for physical servers in years?

Or that Autoscaling Groups with Spot instances can run out of instances even
if your bid amount is higher than the current price in all but one of the
availability zones that it spans?

The list of counter-intuitive gotchas grows very long very quickly once you
move an EC2 app to production.

~~~
kovyrin
This comment is pure gold! That's exactly what I wanted to explain here and
you did it so well. Thanks!

------
jonpress
I think cloud providers are mostly good for small/medium, fast growing
startups. Big companies need more granular flexibility. The path to vertical
integration in the software indsutry is a lot smoother than in many other
industries, there are many levels:

You can...

1\. Make your own hardware

2\. Own/manage your own hardware

3\. Rent commodity hardware from a standard hosting provider

4\. Use IaaS (e.g. EC2)

5\. Use PaaS (e.g. AppFog, Nodejitsu)

6\. Use BaaS (e.g. Firebase, PubNub, Pusher.com)

The higher the level, the more technical flexibility you lose. The bigger the
company, the more it makes sense operate at a lower level because there is no
significant wastage being introduced as you move down the levels (you remove
the middlemen so you can pocket their profits) and the capital cost to move
between levels is relatively low.

If you compare software to another industry like cheesemaking for example, if
you're a cheesemaker and you want to make your own milk, the next step is to
buy the whole farm and then you have to figure out what do do with the meat
(wastage). Going between these two levels is expensive and could mean doubling
or tripling your expenses so it's not an easy move to make.

------
jamiesonbecker
Cost-wise, AWS makes a lot of sense when growth is not easy to project,
happens rapidly, or varies (seasonally, randomly, etc).

Of course, some of this has to do with your team's skill level, but I've had
clients run up $100k+ monthly bills at AWS with a relatively small build-out.
(and, wow, VPC migrations..)

For fixed or predictable growth patterns on a mature app/platform, a slow
build-out on real iron will generally be _significantly_ less expensive, all
other things being equal.

However, there are other advantages to AWS that gets lost in this story, such
as pre-built, highly scalable datastores. Comparing EC2 to real iron misses
most of the real story on why the cloud is changing everything.

One of the hardest things I have to tell clients is not to build their own
datastore/database in house or at EC2; sometimes the case is clearcut, and
sometimes not so much, but if you have a datastore at AWS that gives you 80%
of what you need, use it instead of rolling your own. (source: IAMA AWS Growth
Architect)

------
davidjgraph
It strikes me that more and more a critical selection when growing in this way
is the DNS part. It needs to be back-end agnostic and provide an increasing
amount of functionality.

Health checks and failover are must have now, but this article makes me wonder
three things:

1) Are there any DNS services that understand geography of your "zones", i.e.
route to and failover based on IP? (but are still platform agnostic).

2) How long can a DNS failover take worst case? You can technically set a low
TTL, but don't a lot of ISPs just increase that to a minimum?

3) Isn't it better to replace some of the DNS failover with high availability
dedicated load balancing?

~~~
hhw
1) Yes, there are several DNS service providers that offer BGP anycast with
geographically aware failover / load balancing. UltraDNS and DYN are the
larger ones.

2) Yes, some ISPs do set a minimum TTL. Although BGP anycast is the most
effective as the first line, sometimes it makes sense to have your reverse
proxy caching layer override that distribution based on GeoIP and redirect to
a more suitable proxy node closer to the client. This is especially the case
when people using recursive lookup DNS servers that aren't necessarily
geographically close to them (e.g. 8.8.8.8). It could also be useful in cases
where TTL expiration hasn't caught up yet though.

3) No. Think of BGP Anycast DNS as distribution at a global level, and
dedicated load balancers as distribution at the local level. You need to work
out how to get the traffic to the load balancer first, and load balancing
across distant geographies (high latency) results in horrible performance.

------
peterwwillis
Cases where people try to use EC2:

\- We're starting from scratch and think AWS will give us flexibility for
cheap

\- We have existing servers and think moving from them to AWS will give us
flexibility for cheap

Cases where people move away from EC2:

\- It was slow/unreliable

\- It was expensive

Conlusion: You should use AWS EC2 in order to save money and have flexible
resource allocation, but don't expect it to be stable or cost-effective.

------
billyhoffman
I used Rackspace (and later EC2) originally because of upfront capital costs.
I bootstrapped my company, and paying a thousand dollars or more, plus
hosting/racking fees, just wasn't doable. $50-100 bucks for a reasonable
Windows server? Easy deal to make when cash flow is small.

------
anonbanker
it's about a 5-7 year cycle; IT services are outsourced (this time, to the
cloud), but as time progresses, all of those services are consolidated
(usually, under new management) in-house.

I used to be able to set a watch by HP's cycle before the EDS buyout.

------
qaq
Finally sane take on Cloud BS :)

------
forgottenacc56
I'm telling you theres real opportunity for someone to create "the Uber of
colocation/dedicated hosting".

~~~
superuser2
OpenStack was supposed to facillitate this. Doesn't seem to get much use.

~~~
AndyNemmity
It does from my perspective. I could list many companies that use it for major
things.

It doesn't always get publicity. I work for a major company, and we don't
scream from the roof tops about it.

------
johnnycarcin
Interesting article, thanks for sharing. We actually just went the exact
opposite because of the larger scale issues we were having with Softlayer. Do
you feel like you lost any resiliency by making the switch to physical servers
(more virtual instances on one physical server, servers in the same rack,
etc)?

~~~
kovyrin
No, I really do not think going to EC2 could be beneficial in any way in terms
of improving resiliency compared to Softlayer. SL allows you to control which
VLANs your box will end up on. VLANs could be treated as racks (since they do
not allocate more than one VLAN per rack). Then you have multiple DCs in one
region (e.g. DAL01, DAL05, DAL07, etc) and you have many different regions
(DAL, SEA, WAS, AMS, etc).

I'd be very interested what problems you were having with them and at what
scale. If this is a private topic, we could do it over email or some other
medium if you like. You can contact me by any of the means listed here:
[http://kovyrin.net/contact/](http://kovyrin.net/contact/)

~~~
johnnycarcin
We were about 75% virtual with SL and 25% bare metal. One of our issues with
the virtual stuff is when we started dedicating them to a set VLAN, multiple
times we ran into an issue where some type of resource for that pod the VLAN
was in would be maxed out (usually storage) so we couldn't create a new
instance. The solution we were given was to let the system pick a VLAN but by
doing that we had lost control of the placement and added some complexity to
our architecture.

Aside from that it was mainly nit-picky type stuff, but still things that were
annoying (networking issues between DCs, networking issues between pods,
internal mirrored apt-get repos going out of sync, API is kind of blah, etc).

We use docker so having a few bare metal machines with tons of containers on
them wasn't a great HA setup (for us at least), even running in two data
centers. The fairly quick setup time though was a nice selling point.

When we went to AWS things just kind of worked. The API was easier to use and
the GUI portal was way nicer/stable. So far we have not had any odd issues
with our instances, but we also typically run them at about 50% capacity so
that might be why. It is also still early so maybe things will come up in 6+
months that send us back to SL :)

------
deitcher
Oleksiy, can you share the economics? You said 50% savings. What was
everything you had running in EC2 (easy to figure out the costs), what were
your equivalents in softlayer? Would be very interesting to see the economics.

------
fsniper
Oh bare metal reloaded :)

------
shimon_e
RunAbove from OVH seems to be the best of both worlds. You can rent dedicated
hardware by the hour for very reasonable prices.

~~~
kovyrin
AFAIK Softlayer has hourly-based rental for real hardware as well. Never tried
that, so could not comment on it though.

------
anthony_barker
In my experience firewalls and network is more expensive than servers....

Anyone offer competition to VPC?

~~~
kovyrin
Softlayer manages the network for us. They offer a fully isolated backend
network (frontend connection to the world is optional). For frontend
connections we simply use iptables.

------
abalone
"2-3 serious outages a week" with EC2?? Details, please.

~~~
kovyrin
Those weren't outages for a specific instance. But from the whole pool of
instances we were running 2-3 would have networking issues, random unexplained
hangs requiring an instance restart, huge CPU performance drops, IO hangs,
etc, etc.

------
BackOel
Thanks for the information

