
Breaking a large AWS spend into understandable pieces - fullung
https://segment.com/blog/spotting-a-million-dollars-in-your-aws-account/
======
siliconc0w
Sounds about right, this is really a hidden 'extra cost' with AWS. Once you
get to a certain size, you really need to spend dedicated engineering
resources to build extra tooling like this or do other 'cost engineering' like
RI capacity planning - which you totally thought you didn't need to do anymore
right?

You can offload some of that to paid services like cloud-health but it still
takes engineering resources to manage the costs you find out about. I.e You
may want a 'fully immutable' data pipeline (i.e DPL) - with each job running
on a fresh instance each time but that usually means getting hit with a full
billing hour even if the job takes <1 minute. So you have to use
schedulers/containers but then you're working outside the 'instance' paradigm
and get hit with the problems segment talks about.

To control costs and provide good UX for end users we eventually started
regressing to mainframe-style computing with the ridiculous X1 RIs which was
an interesting experiment. These take like a house down-payment but on the
plus side users are really pleased when they see 128 cpus in htop.

~~~
cakeface
I'm curious what you mean by fully immutable data pipeline or DPL. Is this a
data processing methodology or an actual technology?

~~~
siliconc0w
Ah probably sound have said 'idempotent'. Often, you want a fresh and clean
resource to pop into existence, process something, persist the result
someplace, and then go away. To align with the rest of AWS in terms of
billing, security/IAM, networking, etc you kinda need to use the 'ec2
instance' as this resource but this causes problems so usually the customer
has to create and manage another level of abstraction (like EMR, kube, mesos,
ECS, what have you).

It'd be real nice if the customer didn't have to do this but AWS would rather
spend their engineering resources to compete with the G at beagle
identification APIs or try to vertically integrate with poor clones of github.
(I realize this is _slightly_ unfair )

I know, I know - you're saying 'lambda' is the answer - won't get into that
here.

~~~
RhodesianHunter
I don't think anyone would say Lambda is the answer for anything but very
quick jobs, not until they pump the available run time up to an hour.

------
dberg
We are about to open source our tool that consumes the billing files from S3
(via SQS notification) and then loads them into redshift auto handling the
purging/updating.

Dealing with Amazon billing is absolutely infuriating. We are also on their
EDP (Enterprise Discount) program and all of it is done MANUALLY. There is
literally NO way i can check hundreds if not thousands of instance hours to
ensure there was no error in their processing. Its complete madness. We use a
dozen services and EBS, Bandwidth, inter-AZ bandwidth is impossible to audit.

That being said we have Looker sitting on top of Redshift, we make sure each
"team" has a tag (infra wont spin up without it) and then we can easily set
budgets, watch trends in Looker and track spend by team and by product name.
Our finance team loves it.

The data is there, but Amazon REALLY makes you do the work. Their default
billing reporting is useless.

Will publish on HN when our tool is open sourced.

~~~
fil_a_del_fee_a
Please do :-) I just want to see cost by EC2 instances type.

~~~
arwineap
You might be interested in netflix's opensource solution ice (
[https://github.com/Netflix/ice](https://github.com/Netflix/ice) )

It's been very helpful to notice and track down extra costs as they come in

~~~
JoshRosen
Netflix ICE has a few quirks but overall it's a pretty useful free tool.

Unfortunately, the project has been largely abandoned and the current version
in the Netflix GitHub repository won't work out of the box for many companies
due to lack of support for new AWS regions, instances, reservation types, and
services, as well as showstopping bugs (such as
[https://github.com/Netflix/ice/issues/210](https://github.com/Netflix/ice/issues/210)).
In order to effectively use ICE today most users will need to maintain their
own patched fork.

I'm currently campaigning to create a community-maintained version of ICE,
with committers from multiple organizations, in order to revive the project:
[https://github.com/Netflix/ice/issues/240#issuecomment-29960...](https://github.com/Netflix/ice/issues/240#issuecomment-299604856)

~~~
jpgvm
Heres to hoping the ICE team releases their new tool.

------
natekupp
I'm constantly surprised by how much work Amazon expects its customers to do
themselves. The work that Segment has done here should be a service provided
by AWS directly, continuously updating cost data in a Redshift database
without any customer work required.

We just migrated our data infrastructure to GCP. One of the big motivators was
experiences like this with AWS. We've got near-realtime GCP cost dashboards in
BigQuery, and the only meaningful work on our end to make that happen was
writing the SQL queries.

~~~
jdc0589
Agreed. I don't know why Amazon (and Azure) make this so hard. I've done
something pretty similar to what Segment did (except it supports normalizing
stats from Azure and AWS), and 90% of the work is stuff you don't feel like
you should have to be doing.

~~~
imron
> I don't know why Amazon (and Azure) make this so hard.

The company posting this recently managed to save $1,000,000 annually on their
AWS bill.

Having confusing billing makes it harder to spot that you're paying too much.

~~~
j_s
Yes I think the motivation is clear.

It is the same reason Dropbox doesn't include anything in the web interface to
find large files.

\--

Edit: tossing in some options to accomplish this; not really related to the
current discussion:

Space usage analyser for Dropbox? |
[https://webapps.stackexchange.com/questions/47440/space-
usag...](https://webapps.stackexchange.com/questions/47440/space-usage-
analyser-for-dropbox/56724)

Unclouded - Cloud Manager |
[https://play.google.com/store/apps/details?id=com.cgollner.u...](https://play.google.com/store/apps/details?id=com.cgollner.unclouded)

------
23david
I've been working with a client in a similar situation recently, but instead
of building a custom solution we went with customizing cloudhealth
([https://www.cloudhealthtech.com](https://www.cloudhealthtech.com)).

It's a complicated tool for sure, but once it was all set up we finally had
visibility into a complex multi-account AWS spend, and could start generating
automated cost reports for each company business unit and major customer.

I wouldn't recommend going to the effort of building a custom setup... AWS
billing is just too complicated and it changes frequently to add even more
layers of complexity. As one example, the recent change to add RI size
flexibility completely changed the calculations for RI costs and
recommendations.

I've also used cloudability and cloudcheckr in the past, but both systems had
serious drawbacks. In my opinion cloudhealth is a much more
advanced/professional system at this point.

------
hnov
This is exactly what our team has been working on[0] for over a year now.
Beyond running a simple, 2-3 component service on AWS, it's pretty hard to
actually know what's going on with your billing. Tagging and cost allocation
reports help, but you do quickly run into a wall with shared resources like
segment did.

We don't yet introspect ECS clusters to assign a portion of spend to tasks,
but we do breakdowns of services, tags, instances, ELBs etc across 1-N
accounts. For S3 we can actually introspect buckets and produce rollups by
object metadata as well as heat maps (which objects are being accessed a lot)

[0][https://trackit.io](https://trackit.io)

~~~
mafro
Is that a product your company sells? Or do you plan an open-source version?
Seems like there would be a lot of interest in the OS version.. Could be good
for marketing ;)

------
mattbillenstein
I've went through this process with a couple companies running medium-size
infra. Basically every month we would import the detailed billing report into
a google sheet (SQL queries in your spreadsheet are awesome btw), do some
analysis on the heavy hitting services, and come up with a couple solutions to
address those. Rinse and repeat. A few big hitters I remember off-hand:

1\. Obviously shut down and remove anything you're not using -- instances, ebs
volumes, etc.

2\. ELB - paying per-request for many small requests; and you pay double on
the bandwidth in some cases. Switched to terminating ssl using nginx on the
frontends and using dns load balancing to save a pretty good chunk of money.

3\. vpc endpoints for s3 - can be significant savings if you're doing a lot of
I/O on s3 from private instances in a vpc over a nat gateway.

4\. new instance types -- the newer instance types typically have more
compute/ram/disk for less money, migrate to them.

5\. Consolidate services onto a smaller number of VMs where it makes sense

Once you learn some of these tricks, you just sorta do them that way from the
start on new products. Best practices in terms of spend on these platforms.

~~~
mafro
Solid tips. Could you elaborate a little on 3? It's not something I've heard
of before.

~~~
mattbillenstein
Docs are pretty decent on this:
[http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-
en...](http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-
endpoints.html)

------
smurfy
Great article. Here at Expedia we are building something very similar for our
AWS spend as we are migrating. Since our spend is multi-million per month (and
we have barely started migrating), it's even more difficult. Hopefully, we
will have a blog post in the coming months as well detailing it.

------
jakozaur
I use SaaS CloudHealth. We tag all resources once and never change tags. Than
we have late binding all those tags into useful groups (called perspective in
CloudHealth).

We tried initially building it on our own, but the engineering cost was way
too high. Especially given all quirks and changes in AWS.

------
napsterbr
Honest question: when you get anywhere near six figure monthly bills, isn't it
the time to migrate to your own hardware?

Sure, there are _several_ pros and cons to weigh, but if the application isn't
locked in, a migration to metal could make sense. Anyone went through this?

~~~
bdcravens
I would imagine that unless you have staff sitting on their hands, that level
of hardware commitment would require at least one extra employee, generally at
a cost of 6 figures a year.

~~~
bsenftner
Nope, these days servers pretty much run themselves. One just needs to make
sure hardware is not failing, and that is pretty much it. Really. Speaking
from experience.

~~~
bdcravens
Sure, though we're talking about a 6 figure/mo cloud spend. That's dozens or
hundreds of servers. (+network config etc) I run our tech stack, and I can
stand up instances pretty easily, but I had to rack and stack, manage power,
bring in leased lines, I'd be well outside of my skill set.

------
skwashd
Our spend is anything like Segment, but we've been using stax.io for tracking
our AWS spend. Stax doesn't currently support all AWS services, but they're
got the key ones. There are some nice auditing tools in stax for ensuring
service config meets corporate standards.

It picks up on your AWS tags and then lets you allocate that to
projects/departments/cost centres. Reporting on untagged service is available
OOTB.

The business model is a 30 day trial and then you're charged a percentage of
your AWS monthly spend. It worked out cheaper than building something
ourselves.

------
cyberferret
Analysing AWS billing information is a bit of a 'dark art' in my experience. I
wish they could improve some of their reporting tools to make it easier to
reconcile assets.

For example, we switched to purchasing reserved instances up front to try and
reduce our monthly bills (albeit at a higher up front cost). But trying to
match up Reserved instances with EC2 or Elastic Beanstalk instances that are
actually running is a real headache.

Several times, we've been caught with useless reserved instances sitting there
not being utilised - because we had shut down a project or similar. Had we
been able to see "Oh, we have 10 t1.micro reserved instances, but only 7 are
being utilised by active EC2 instances at the moment" then we could easily
know that we could provision another 3 t1.micro instances at no extra cost to
experiment with something or for a client project. Or we could convert those
t1.micro instances to another region or instance type to meet demand
elsewhere.

It would also be handy on their billing reports if they delineated or showed
the breakdown of the total usage hours against 'on demand' and 'reserved'
instances. I know they _sort of_ do that now, but I think it could be better
laid out so we can see at a glance whether we are getting best use from our
reserved instances.

~~~
sickmate
The Cost Explorer now has Reserved Instance utilization and coverage reports
which are quite useful.

~~~
cyberferret
Thank you!! That RI Utilisation report is perfect! I have never delved into
the Cost Explorer reports before, but this is _exactly_ what we are after, and
I appreciate you taking time to point this out.

------
Rapzid
I started project on github for loading detailed billing reports into
elasticsearch so you can analyze the information with kibana:
[https://github.com/ProTip/aws-elk-billing](https://github.com/ProTip/aws-elk-
billing) . I don't really maintain it anymore, but a group of people have
contributed quite a bit more to it over the past couple years. Some here may
find it useful.

------
Sujan
Would using something like [http://cloudcheckr.com/](http://cloudcheckr.com/)
or [https://www.cloudability.com/](https://www.cloudability.com/) have worked,
too?

~~~
Dunedan
Aside from several other companies offering AWS cost analysis as a service,
there is also Netflix ice
([https://github.com/Netflix/ice](https://github.com/Netflix/ice)) if one
wants an open-source, self-hosted solution.

AWS also published a quick guide how to get the billing data into Redshift and
use Quicksight to analyze it: [https://aws.amazon.com/de/blogs/aws/new-upload-
aws-cost-usag...](https://aws.amazon.com/de/blogs/aws/new-upload-aws-cost-
usage-reports-to-redshift-and-quicksight/)

------
yawgmoth
For comparison, the invoice provided by Azure is inherently itemized. "X hours
of Compute at Y rate", "X units of storage on PREMIUM/STANDARD storage type",
etc.

Edit: Sample invoice: [https://www.microsoft.com/en-
us/download/details.aspx?id=388...](https://www.microsoft.com/en-
us/download/details.aspx?id=38805)

~~~
jdc0589
the azure billing report is a disaster; they seem to be slowly fixing it
though. The line items are tricky and usually not super human friendly,
multiple billable resources roll up to single items with weird quantities in
some instances, if you have mixed ARM and ASM resources only the ARM stuff in
your usage report will have data for tags (cause there's no concept of
inheritance from parent resource group), ARM and ASM resources have different
formats for some common fields, and tons more I've repressed from memory.

If you have an EA agreement, half (or more?) of the billing report interfaces
don't work. If you have to use the EA portal due to an EA agreement....lets
just say within the past year they have been through some rough periods of
random undocumented format changes, magically disappearing billing reports for
a few hours at a time, random intermixed windows and linux style line endings
in the same file, periods where they are DAYS behind your actual usage, etc...

Source: I wrote a infrastructure inventory and billing/usage aggregation
system that normalizes data from Azure, AWS, and vCenter to help our company
easily track spending by service, tags, etc... across all of platforms where
we have a presence.

------
x_foo_x
Just use more than one AWS account!!!! Use Roles to manage any cross-account
dependencies.

~~~
wahnfrieden
Does not help the author's particular use case with ECS.

------
rohan404
Shameless plug, but it was the lack of clarity that we had with our AWS bills
that prompted us to build [https://cloudops.ai](https://cloudops.ai)

We've got a prepaid service that allows you to always ensure your AWS bill
never goes above certain amount as well as flat discounts on your monthly
spend

------
cagataygurturk
The solution is one account per team
[https://stups.io/why](https://stups.io/why)

Installing ICE to see costs in detail
[https://github.com/Netflix/ice](https://github.com/Netflix/ice)

------
grundprinzip
To make sure that resources are compliant I would suggest using AWS Config
Rules as they allow to specify for example that tags need to be applied.
Validation of resources is done in a lambda function which should in theory
allow to terminate the resources if they are not compliant.

------
zitterbewegung
If you are getting to the point of a large AWS spend wouldn't that presuppose
that you might want to start moving on to on premesis physical servers ?

------
avenoir
As someone who's a noob to AWS and is getting ready to launch a product on it,
this is absolutely terrifying. As in what to expect in terms of billing.

~~~
kevinburke
Extremely unnecessary until your bill gets large, and/or you have many
different teams using AWS

------
jorblumesea
Anyone else see the irony of using aws services to understand your aws bill?
It makes complete sense, just funny.

------
LeicaLatte
Very nice and detailed.

------
monochromatic
When did "spend" become a noun?

~~~
twic
In the 1930s:

[https://books.google.com/ngrams/graph?content=average%20spen...](https://books.google.com/ngrams/graph?content=average%20spend%2Caverage%20spending&year_start=1900&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Caverage%20spend%3B%2Cc0%3B.t1%3B%2Caverage%20spending%3B%2Cc0)

Via:

[https://english.stackexchange.com/questions/336478/is-it-
rea...](https://english.stackexchange.com/questions/336478/is-it-really-ok-to-
use-spend-as-a-noun)

EDIT: lies, first use is 1688:

[http://www.grammarphobia.com/blog/2012/04/spend.html](http://www.grammarphobia.com/blog/2012/04/spend.html)

