
How We Saved $132k a Year With an IT Infrastructure Audit - joshsharp
https://overflow.buffer.com/2016/03/31/how-we-saved-132k-a-year-by-spring-cleaning-our-back-end/
======
buro9
I will add another suggestion, if you use S3 at all... one of your largest
costs is likely the bandwidth. Have you considered just placing caches in
front?

I took a $100 per month S3 bill down to $5 per month by simply having existing
Nginx servers enable a file cache.

It does help that I never need to cache purge (versions are saved in S3 and
become URLs), but it was super trivial to just wipe out $95 per month of cost
for zero extra spend.

My current setup is:

S3 contains user photos, web app (not on AWS) handles POST/GET for S3 (and
storing local knowledge), Nginx at my edge has a cache that is currently
around 28GB of files, and then CloudFlare in front of all of this saves me
around 2TB of bandwidth per month.

The real gotcha for me is that I was relying on CDNs for my cache, but when
CDNs reach 50+ PoPs I was starting to see multiple requests for the same thing
as a result of people in different cities requesting a file. So the Nginx
cache I've added mostly deals with this scenario and prevents additional S3
cost being incurred.

~~~
rakoo
> The real gotcha for me is that I was relying on CDNs for my cache, but when
> CDNs reach 50+ PoPs I was starting to see multiple requests for the same
> thing as a result of people in different cities requesting a file.

Having never used a CDN, this sounds weird. This means that the caches are not
synchronized between PoPs, even though they're supposed to be at the same
level in an HTTP request. Is this normal behaviour for a CDN ? I'd expect one
PoP to check with other PoPs before hitting upstream.

~~~
buro9
It's normal for most smaller CDNs to not have their PoPs communicate, yes.

With larger CDNs you start to get hierarchical caches:
[https://trafficserver.readthedocs.org/en/5.3.x/admin/hierach...](https://trafficserver.readthedocs.org/en/5.3.x/admin/hierachical-
caching.en.html)

The theory being that the PoP closest to the origin is the one responsible for
going to the origin and thus that other PoPs will fetch cached items for the
closest PoP to origin.

Nearly all of the very large CDNs support some degree of hierarchical caching,
and the ones becoming large are gaining the capability.

At CloudFlare (where I work) the need for a hierarchical cache was low
priority until we started to rapidly increase our global capacity... once you
reach a certain scale then the need for some way to have PoPs not visit the
origin more than once for an item becomes very important. You can be sure
we're working on that (if you are an Enterprise customer you could contact
sales to enquire about beta testing for it).

But right now, for us and many other providers, just enabling an nginx cache
in front of any expensive resource will help. By "expensive", I generally mean
anything that will trigger extra expenditure or processing when it could be
cached.

Edit: Additionally, nearly every CDN is operating an LRU cache. You haven't
bought storage and not everything you've ever stored is being held in cache.
You only need a few bots (GoogleBot, Yandex, Baidu, etc) constantly spidering
to be pulling the long tail of files from your backend S3 if you haven't got
your own cache in front of it. Hierarchical caching isn't a silver bullet that
takes care of all potential costs incurred, but having your own cache is.

~~~
aembleton
What size is the Cloudflare cache on the Free and Pro plans?

~~~
buro9
It's a question without an exact answer.

It depends where your origin is, where your users are (near 1 PoP? 2 PoPs?
spread evenly globally?), how frequently files that can be cached are
requested, the capacity of each PoP, the contention of each PoP, etc.

A few years ago when the customer and traffic growth rate exceeded the network
expansion rate the answer was probably "not big enough", but we've since
upgraded almost every PoP and added a huge number of new ones:
[https://www.cloudflare.com/network-map/](https://www.cloudflare.com/network-
map/)

The answer now is "more than big enough".

We cache as much as possible, for as long as possible. The more requested a
file, the more likely it is to be in the cache even if you're on the Free
plan. Lots of logic is applied to this, more than could fit in this reply.

But importantly; there's no difference in how much you can cache between the
plans. Wherever it is possible, we make the Free plan have as much capability
as the other plans.

------
gedrap
A few people in the comments are saying 'isn't better to just setup your own
SQL server instead of RDS?' and similar. I don't want to post a reply to each,
so I will say it here.

While I can totally sympathize from a programmer point of view (setting up,
tweaking stuff and all is a great fun), but you need to ask yourself whether
it is in the interests of business to do so. Especially if you're working in a
small team with no dedicated infrastructure staff or a startup with a short
runway and a lot of urgent user facing changes.

Doing something on your own (e.g. setting up your own alternative to S3, or
configuring your own SQL servers), comes with a cost and it's not only the
programming/initial setup time. It's also opportunity cost (instead of setting
up a server, I could, for example, analyze some user data); maintenance (more
things to worry about which you can outsource); skills set required to run the
infrastructure (running your own SQL cluster requires more knowledge, more
training than running one on RDS), etc.

So is it in the interest of the business to run your own infrastructure?

If you have thousands of servers and spending millions on it - probably, but
then probably you can make an attractive deal with GCE or AWS :)

If your application needs some complex performance related stuff which is
harder to do in the cloud (e.g. some custom hardware or whatever), then again,
running your own infrastructure might be better.

But if you are like the majority of the companies/products (you just need
infrastructure to run reliably and performance should be just good enough),
using AWS and friends might make a big difference.

~~~
latch
If you come to the conclusion that the opportunity cost is too high for your
team/company, fine...I can believe it....as long as you're also weighing the
benefit of learning. All learning has an opportunity cost.

I do believe that someone who can setup and configure nginx to do load
balancing, caching, rate limiting and build middlewares in LUA is potentially
going to be a more productive full stack developer than someone who can't. As
a programmer, having managed postgresql, has made me a more effective
programmer. I have a good understanding of how it vacuum and collects
statistics, the relationship between connections and sorts and work_mem, so
I'm better equipped at writing queries and troubleshooting issues.

The gain to me personally and to my employer (and future employers) is not
trivial.

~~~
rashkov
This reminds me of recent discussions about how countries that outsource their
manufacturing quickly lose knowledge of manufacturing technology and fall
behind in innovation and self-reliance. I don't think we are there yet, but it
is conceivable that in the future system administration could become a lost
art to many. Something to consider as more of our infrastructure needs are met
by the cloud. I'm sure I'm overstating this, but I figured I'd share anyway

------
latch
If saving money and giving your users a better experience are priorities then,
_in general_ , moving off AWS is worth considering. When it comes to EC2, spot
instances or maybe you're doing it wrong.

Conservatively speaking, without bandwidth, you're looking EC2 costing 2-4x
more than dedicated while being 2-4x slower. This does depend on specific
workload, and the gap _has_ been closing (conversely, I've seen specific
workloads be worse than 4x slower).

I know RDS is convenient. But learning how to setup and manage your own
database is actually a fundamental skill that will serve you well. All
learning can be seen as an opportunity cost, but this is one that will let you
save money every month, and give your users a better experience.

~~~
throwaway2016a
Do you have any place you can point for that?

Anecdotally, I have found that when we account for...

\- Human resources costs (payroll, taxes, benefits) [1]

\- Time to market for new features / products

\- Utilizing reserved instances where it makes sense

\- Appropriately sizing machines

We get with AWS...

\- Faster time to market (accelerated revenue)

\- Relatively the same cost per month (cheaper in some areas more expensive in
others)

\- Significantly lower initial investment

\- Increased redundancy (via many small servers vs. a few large servers) /
decreased disaster recovery times (and in many cases automated recovery)

I'm not saying you're wrong, that's just what I've seen when we run the
numbers internally. You may have seen differently which is why I ask.

[1] A devops person to manage servers dedicated or not can cost $80-$125k+ a
year after benefits and taxes. That is a lot of AWS instances. And we have
found we need an IT staff about half the size to manage AWS vs a dedicated
data center.

~~~
karterk
Being on AWS does not necessarily mean that you don't need a devops person -
especially at the scale where not being on AWS actually makes a difference to
your margins.

I have seen quite a few people move off AWS successfully onto bare-metal
leased hardware. S3 is just about the only service that's difficult to find an
alternative for. Personally, I find using something like DynamoDB no different
than using an Oracle DB - it's a vendor lock-in. Unless you had enterprise-
level support on AWS (which costs a lot) - if you run into issues with
Amazon's proprietary services, then good luck to you.

AWS is great to get started, but once you know that you're going to need scale
(and lots of infra), it's best to move.

I say all of this as someone who has extensive operating experience on AWS.
YMMV.

~~~
UK-AL
Oh, with cloud services you defiantly need devops.

Without AWS you also need dedicated infrastructure, networking and hardware
people as well as devops.

You need people who know how to configure cisco networking gear, who
understand SAN's, iSCSI, Fibrechannel, Racks, blades lots of stuff even devops
people don't think about.

~~~
tristor
What exactly is "DevOps" to you? I ask, because almost everyone has a
different answer. I've been doing "DevOps" for more than ten years, and all of
the items you listed as required skill-sets are within my capabilities and
have been used at many of the places I've worked. I'd be hard pressed to call
someone an ops person if they don't understand the basics of server hardware,
networking, and storage. These are essential components which the system
relies on, the same system you are responsible for the uptime of.

I often hear things like your statements and can't help but wonder if the
general quality of ops people is so bad in our industry and I just haven't
encountered it, or if the reason ops people are treated so poorly in most
organizations is just that developers automatically assume we don't know
anything rather than asking.

~~~
UK-AL
Devops for me would be things like puppet, networking(subnets, load balancing,
firewalls etc) deployments, Cloud Formation templates, ARM Templates rather
than directly setting up hardware.

Where a dedicated networking would know specifics of certain vender. You can
make career out of just knowing how to set up CISCO hardware and CISCO's
embedded os. Devop's people tend to be broader than that.

~~~
tristor
All of the items you listed fall under my definition of "DevOps" as well. I
loosely define it in two ways:

1) "DevOps is a philosophy, not a title" (this is mostly because of managers
thinking otherwise)

2) "DevOps is about focusing on automation of systems infrastructure to
improve reliability, flexibility, and security."

Regarding #2, though, since my past experience includes building public
clouds, my perspective does not limit "DevOps" to only utilizing public
clouds. You can automate the build-out of physical hardware too. It's not
really possible to automate rack n stack, but you can abstract that away
through external logistics vendors that pre-rack/cable gear for you at a
certain scalability point.

Things like OpenStack Ironic, Dell Crowbar, Cobbler, Foreman, etc. are
definitely DevOps tools, yet they are specifically focused on handling
automation of physical hardware deployments.

As a further example, many networking vendors now provide APIs, but even when
they didn't they had SSH interfaces. It was very possible to automate the
deployment of large quantities of networking gear using remote-execution tools
like Ansible or even just Ruby or BASH scripts. There's no need necessarily to
have a dedicated networking person.

Of course, as you scale up to a certain point in physical gear, it pays to
have specialization. But that's true even in the cloud, where you may need to
hire a specialist to deal with your databases, a specialist to deal with
complexities of geographical scale/distributed systems, a specialist to deal
with complex cloud networking (VPC et al). Just because something is
abstracted away into a virtual space doesn't necessarily reduce its complexity
or the base skillsets required to operate that infrastructure.

------
joshsharp
One thing I found surprising on reading this is that they essentially spent
$10k and two months ($5k listed as saved * 2 months) to figure out they
weren't using their logging infrastructure any more. Wish I could be that
gung-ho with resources!

------
tssva
The title should really be "How we stopped wasting $132k a Year With an IT
Infrastructure Audit." The article certainly leaves the impression that a lack
of processes, procedures and basic change management left you in a position
where you were wasting 132k a year. As I read it the article shows some
recognition of that fact but a false dichotomy between innovation, time to
market and change control is used to justify not properly addressing the
issue. Thus it will most likely strike again but in a more painful manner.

~~~
T2_t2
[https://buffer.baremetrics.com/](https://buffer.baremetrics.com/)
Specifically
[https://buffer.baremetrics.com/stats/mrr#start_date=2015-10-...](https://buffer.baremetrics.com/stats/mrr#start_date=2015-10-01&end_date=2016-03-31)

Buffer has grown from $509K 12 months ago, to $663K a month 6 months ago, to
$782K a month now. If being able to quickly iterate helped at all, that $132K
is more than covered by the MRR growth from 12 months ago to today.

~~~
latchkey
You're justifying wasting money because the company is making money? While I
think the tone of the comment above is pretty inflammatory, the person is
totally right. It was the first thing I thought of when I read the title.

If you're spending $132k/mo on hosting to make only $782k/mo, you're doing it
wrong.

~~~
sokoloff
Where did $132K/mo figure come from? The only place I see that quoted is in an
annual savings number.

~~~
latchkey
You're right... they spend even more than $132k/month. They just trimmed the
total amount. So they are even worse off. =(

~~~
sokoloff
Is there evidence for that in the article, or supposition?

------
daigoba66
I wonder where the tipping is at which it's more economical to own/lease your
infrastructure instead of AWS/Azure/etc.?

The company is certainly large enough to have their own infrastructure team.

But granted, migrating an entire system that makes extensive use of the AWS
ecosystem is anything but trivial.

~~~
latch
In terms of scale, for EC2 and all services that run on it, it's more
economical at ALL points to rent (dedicated) or buy (colocate). S3 is more
competitive, so that question becomes more interesting.

There might be specific use-cases/apps/business with high burstiness where the
general case isn't true. There might also be teams that can't manage it.

~~~
brianwawok
Careful about ALL. What about something that fits in the free tier? OR $1 a
month Lambda tier?

~~~
AznHisoka
That's a minor exception not worth adding a footnote for.

~~~
brianwawok
So it's cool to just blindly say ALL when you mean "Many" ?

~~~
latch
I could cop-out and say "but if it costs you 10x more after the free-tier is
expired, is it really free?"

But I can admit that I forgot abut the free-tier. It's a good deal and a smart
business move by amazon. Sorry.

------
AdamN
The money saved isn't nearly as important as the reduced complexity and
exposure gained from a regular spring cleaning.

------
javajosh
Am I the only one who initially read this as "How We Saved $132k a Year With
an IT Infrastructure Adult"? Having an IT Infrastructure Adult is essential,
after all.

~~~
wmf
The first thing the adult does is order an audit, so it amounts to the same
thing.

------
ktamura
>For the longest while we’ve used fluent.d[sic] to log events in our systems.

As a maintainer, glad to see Fluentd there =) If folks have questions on
Fluentd, I'd be happy to answer them here.

------
ing33k
Will add my personal experience regarding RDS .

If you are using RDS with provisioned IOPS , you can reduce your bill
dramatically by downgrading to General purpose SSD, I know certain
applications might really need the dedicated IOPS, but its better if you
monitor your read / write rates and decide accordingly.

------
peteretep
So they put a team of engineers on a problem and managed to save an amount
roughly equivalent to the salary of one of those engineers? Also, don't AWS
resources generally get cheaper with time?

~~~
mgkimsal
nstart already pointed out that it didn't take long to do it, but... i've seen
this attitude over and over again - "expenses are cheap, engineers are
expensive, money will just keep flowing", etc.

We don't all live in worlds of unlimited budgets. For them to have taken 2
weeks ( _maybe_ $5k of effort) to save $130k plus is phenomenal (and yet, also
just mundane). This means more money can be spent on hiring someone else, or
higher profit sharing for all, vs just continually and marginally increasing
someone else's bottom line.

As buffer grows, their needs will grow and expenses will grow. I hate to push
out predictions too far in the future, but this $5k of effort is probably
going to save them $500k over the next 3-5 years.

What's a bit irritating in all this is that the folks behind it probably won't
be rewarded correspondingly (although buffer seems a bit more open and
egalitarian about these sorts of things).

~~~
Mtinie
I'm glad you called out the cumulative savings benefit.

I'll add that they also gain a benefit multiplier in the form of the knowledge
they gained from the exercise. Something that will hopefully be carried
through into their future infrastructure decision making processes.

~~~
mgkimsal
I was going to mention that too - the benefit to future projects, budgets and
employers is potentially quite large.

------
pm
The amount saved affords an extra C-level hire. What intrigues me though, is
what required them jumping from 25 to 80 people.

------
nasalgoat
One thing that really stands out is they aren't running reserved instances.
That must be incredibly expensive.

------
kinther
Any way this could be done with Microsoft Azure as well?

~~~
windowsworkstoo
Well, obviously. The key point is to review what you have, get rid of what you
don't need and negotiate better pricing for what you do need. If you aren't
doing this at least annually anyway, you are leaving money on the table. The
platform doesn't matter.

------
willholloway
I was able to save a small, niche web hosting company with it's own
proprietary CRM around $112k annually. I also greatly improved security in
their shared hosting setup and automated code deployments and new customer
onboarding.

I was able to do this by using Edgecast CDN as a caching proxy for all
anonymous traffic. This reduced the load on their servers greatly and we were
able to decimate the number of servers required. Rackspace servers are
incredibly expensive and this represented a big savings.

We could have cached the pages in other ways, but this had the added benefit
of serving anonymous requests from edge nodes and this reduced page load time
by a great bit.

It was a big migration with sometimes maddening constraints imposed by
business necessities and technical debt, but in the end we were able to
eliminate a good bit of that debt.

The most frustrating part of the process was having to deal with sales reps
that kept trying to push "cloud" solutions as panacea to all scaling
challenges.

The bandwidth and storage costs from using something like S3 would have been
atrocious. The rackspace "cloud" solutions all would have had unacceptable
latency problems.

And it would have required impossible code rewrites. One of the requirements
for the project that was incredibly frustrating was that we could not force
updates of the PHP CRM to any particular client. We offered an upgrade path,
but we had dozens of different versions of the software running on the
servers, along with Wordpress and other PHP/MySQL apps installed at customer
request.

Shared PHP web hosting is one of the most difficult environments to work in.
Each account was a petri dish of whatever customers uploaded via their docroot
FTP access.

I pushed through a lot of changes to eliminate that practice and lock users
down to FTP access for directories that would not execute PHP.

I also had the company move all Wordpress installs to Flywheel, to eliminate
all the maintenance and security implications for Wordpress to a company that
focused on just that. This allowed the company to focus on it's own CRM and
nothing else.

All of it came at a really key time because competitive pressure from
squarespace forced the company to drastically reduce prices.

When I pitched the original idea for the project, the entire internal team
didn't tell me about the security issues, the multitude of versions running,
the full FTP docroot access or even the existence of Wordpress on the servers.

When I discovered how FUBAR the entire setup was, as a contractor it would
have been easy for me to bail, and probably personally healthier for my stress
levels (I took off three months after it was all over to relax), but I stuck
with them and brought them through the entire process to a successful
conclusion.

I'm pretty proud of that project. I pulled it all off under some of the most
difficult and irrational conditions one could imagine.

And the ROI for the company was insanely great.

