Hacker News new | past | comments | ask | show | jobs | submit login
Breaking a large AWS spend into understandable pieces (segment.com)
240 points by fullung on May 18, 2017 | hide | past | web | favorite | 74 comments

Sounds about right, this is really a hidden 'extra cost' with AWS. Once you get to a certain size, you really need to spend dedicated engineering resources to build extra tooling like this or do other 'cost engineering' like RI capacity planning - which you totally thought you didn't need to do anymore right?

You can offload some of that to paid services like cloud-health but it still takes engineering resources to manage the costs you find out about. I.e You may want a 'fully immutable' data pipeline (i.e DPL) - with each job running on a fresh instance each time but that usually means getting hit with a full billing hour even if the job takes <1 minute. So you have to use schedulers/containers but then you're working outside the 'instance' paradigm and get hit with the problems segment talks about.

To control costs and provide good UX for end users we eventually started regressing to mainframe-style computing with the ridiculous X1 RIs which was an interesting experiment. These take like a house down-payment but on the plus side users are really pleased when they see 128 cpus in htop.

I'm curious what you mean by fully immutable data pipeline or DPL. Is this a data processing methodology or an actual technology?

Ah probably sound have said 'idempotent'. Often, you want a fresh and clean resource to pop into existence, process something, persist the result someplace, and then go away. To align with the rest of AWS in terms of billing, security/IAM, networking, etc you kinda need to use the 'ec2 instance' as this resource but this causes problems so usually the customer has to create and manage another level of abstraction (like EMR, kube, mesos, ECS, what have you).

It'd be real nice if the customer didn't have to do this but AWS would rather spend their engineering resources to compete with the G at beagle identification APIs or try to vertically integrate with poor clones of github. (I realize this is slightly unfair )

I know, I know - you're saying 'lambda' is the answer - won't get into that here.

I don't think anyone would say Lambda is the answer for anything but very quick jobs, not until they pump the available run time up to an hour.

Can you go into detail of what you mean by "mainframe style computing "?

At a guess: Get the biggest box you can, run a ton of things on it (eg containers, or something like that).

This is opposed to the 'cloud' thing of get an instance just big enough to run your job, run lots of them if you need to scale.

We are about to open source our tool that consumes the billing files from S3 (via SQS notification) and then loads them into redshift auto handling the purging/updating.

Dealing with Amazon billing is absolutely infuriating. We are also on their EDP (Enterprise Discount) program and all of it is done MANUALLY. There is literally NO way i can check hundreds if not thousands of instance hours to ensure there was no error in their processing. Its complete madness. We use a dozen services and EBS, Bandwidth, inter-AZ bandwidth is impossible to audit.

That being said we have Looker sitting on top of Redshift, we make sure each "team" has a tag (infra wont spin up without it) and then we can easily set budgets, watch trends in Looker and track spend by team and by product name. Our finance team loves it.

The data is there, but Amazon REALLY makes you do the work. Their default billing reporting is useless.

Will publish on HN when our tool is open sourced.

Please do :-) I just want to see cost by EC2 instances type.

You might be interested in netflix's opensource solution ice ( https://github.com/Netflix/ice )

It's been very helpful to notice and track down extra costs as they come in

Netflix ICE has a few quirks but overall it's a pretty useful free tool.

Unfortunately, the project has been largely abandoned and the current version in the Netflix GitHub repository won't work out of the box for many companies due to lack of support for new AWS regions, instances, reservation types, and services, as well as showstopping bugs (such as https://github.com/Netflix/ice/issues/210). In order to effectively use ICE today most users will need to maintain their own patched fork.

I'm currently campaigning to create a community-maintained version of ICE, with committers from multiple organizations, in order to revive the project: https://github.com/Netflix/ice/issues/240#issuecomment-29960...

Heres to hoping the ICE team releases their new tool.

That is available in the cost explorer...

Cost explorer is EXTREMELY limited.

If you want to get good information by Tag, By Product, By detailed Date Range, etc you need it in a DB.

The cost explorer gives you information by tag and/or product. You really don't need a DB.

I'm constantly surprised by how much work Amazon expects its customers to do themselves. The work that Segment has done here should be a service provided by AWS directly, continuously updating cost data in a Redshift database without any customer work required.

We just migrated our data infrastructure to GCP. One of the big motivators was experiences like this with AWS. We've got near-realtime GCP cost dashboards in BigQuery, and the only meaningful work on our end to make that happen was writing the SQL queries.

Agreed. I don't know why Amazon (and Azure) make this so hard. I've done something pretty similar to what Segment did (except it supports normalizing stats from Azure and AWS), and 90% of the work is stuff you don't feel like you should have to be doing.

> I don't know why Amazon (and Azure) make this so hard.

The company posting this recently managed to save $1,000,000 annually on their AWS bill.

Having confusing billing makes it harder to spot that you're paying too much.

Yes I think the motivation is clear.

It is the same reason Dropbox doesn't include anything in the web interface to find large files.


Edit: tossing in some options to accomplish this; not really related to the current discussion:

Space usage analyser for Dropbox? | https://webapps.stackexchange.com/questions/47440/space-usag...

Unclouded - Cloud Manager | https://play.google.com/store/apps/details?id=com.cgollner.u...

Product development incentive structures.

Imagine you lead the product team at AWS.

- The team is reviewing what to build for the next quarter.

- You have a long list of revenue-generating features

- At the end of the list there's one more item that'll help your customers spend LESS money on your product.

- You can only build so many things and you know that you, your department, and Amazon the company will get pats on the back in $$$ form if you focus on the revenue generating features

- Sure, it'd be really good to help longterm customers understand their costs better, but your biggest ones have the resources to build that infrastructure themselves anyways.

And that is why this is so hard.

I think you're absolutely right, but what's hilarious to me is that they have private pricing for many of their services. For the traffic we're already doing, we just asked and they knocked our cloudfront bill down nearly 75%. We didn't have to change our usage at all. Granted, we serve a lot of traffic.

What order of magnitude of traffic are we talking?

A couple of PB per week, I'd guess

I'm pretty sure it's in Amazon's interest for its costs to be opaque.

Second this. We just moved our sizable infrastructure to GCP for more or less the exact same reason. GCP makes all of this a breeze -- I don't understand why other providers can't do the same.

How much more would current customers pay for better billing dashboards?

How many customers quit amazon due to hard to read billing?

I suspect those are the two big questions to explain why this feature is lower on the priority radar...

This way of looking at things has never been particularly convincing to me... Of all the work that's done, at any organisation, almost none of it can truly answer those questions affirmatively.

"That lightbulb is broken!"

"I was going to replace it, but I couldn't find a customer to pay for it, and nobody has threatened to change to a brighter competitor"

Have you ever founded a startup? Did you spend your time working on features users will pay for, or an awesome billing dashboard (assuming billing is not your core product of course?) If so, how successful were you?

It's about focus. Only so many hours in a day, so many dollars, so many developers. If AWS got a nicer billing daashboard but Lambda was delayed by 6 months, how would it have changed the place AWS has in the market?

Yes you could take this too far and never change a light bulb. But I would answer "I will be more productive if I have a nice light bulb, so it is worth the 5 minutes to change it"

I'm sure amazon is working on it.

They're letting their partners work on it.

I've been working with a client in a similar situation recently, but instead of building a custom solution we went with customizing cloudhealth (https://www.cloudhealthtech.com).

It's a complicated tool for sure, but once it was all set up we finally had visibility into a complex multi-account AWS spend, and could start generating automated cost reports for each company business unit and major customer.

I wouldn't recommend going to the effort of building a custom setup... AWS billing is just too complicated and it changes frequently to add even more layers of complexity. As one example, the recent change to add RI size flexibility completely changed the calculations for RI costs and recommendations.

I've also used cloudability and cloudcheckr in the past, but both systems had serious drawbacks. In my opinion cloudhealth is a much more advanced/professional system at this point.

This is exactly what our team has been working on[0] for over a year now. Beyond running a simple, 2-3 component service on AWS, it's pretty hard to actually know what's going on with your billing. Tagging and cost allocation reports help, but you do quickly run into a wall with shared resources like segment did.

We don't yet introspect ECS clusters to assign a portion of spend to tasks, but we do breakdowns of services, tags, instances, ELBs etc across 1-N accounts. For S3 we can actually introspect buckets and produce rollups by object metadata as well as heat maps (which objects are being accessed a lot)


Is that a product your company sells? Or do you plan an open-source version? Seems like there would be a lot of interest in the OS version.. Could be good for marketing ;)

I've went through this process with a couple companies running medium-size infra. Basically every month we would import the detailed billing report into a google sheet (SQL queries in your spreadsheet are awesome btw), do some analysis on the heavy hitting services, and come up with a couple solutions to address those. Rinse and repeat. A few big hitters I remember off-hand:

1. Obviously shut down and remove anything you're not using -- instances, ebs volumes, etc.

2. ELB - paying per-request for many small requests; and you pay double on the bandwidth in some cases. Switched to terminating ssl using nginx on the frontends and using dns load balancing to save a pretty good chunk of money.

3. vpc endpoints for s3 - can be significant savings if you're doing a lot of I/O on s3 from private instances in a vpc over a nat gateway.

4. new instance types -- the newer instance types typically have more compute/ram/disk for less money, migrate to them.

5. Consolidate services onto a smaller number of VMs where it makes sense

Once you learn some of these tricks, you just sorta do them that way from the start on new products. Best practices in terms of spend on these platforms.

Solid tips. Could you elaborate a little on 3? It's not something I've heard of before.

Great article. Here at Expedia we are building something very similar for our AWS spend as we are migrating. Since our spend is multi-million per month (and we have barely started migrating), it's even more difficult. Hopefully, we will have a blog post in the coming months as well detailing it.

I use SaaS CloudHealth. We tag all resources once and never change tags. Than we have late binding all those tags into useful groups (called perspective in CloudHealth).

We tried initially building it on our own, but the engineering cost was way too high. Especially given all quirks and changes in AWS.

Honest question: when you get anywhere near six figure monthly bills, isn't it the time to migrate to your own hardware?

Sure, there are several pros and cons to weigh, but if the application isn't locked in, a migration to metal could make sense. Anyone went through this?

Working at a large company migrating from our own datacenters to AWS where our monthly bill is approaching 8 figures.

The higher ups were lured in with promises of easy elasticity that'd allow them to save money compared to our relatively overprovisioned DCs but the reality is that only some projects that benefit from this and certain teams which need large numbers of heavy duty servers for baseline usage are much more expensive.

We have an internal team of 5-6 who are responsible just for building infrastructure on AWS for other teams and prodding teams to get into AWS.

At the very least you'd need people with experience running bare metal, delay product pipeline to work on the transition, train the team, build replacements for all of the AWS tools, rebuild this monitoring pipeline without Cloudwatch.

Probably looking at $500k in engineering and staffing and training costs at the very least, just to get back to where you started, not to mention the progress you didn't make on the product.

Depends on your workload. If you have a constant workload, then sure. But if you have variable workloads then AWS/cloud providers are a no brainer.

Source: 6 figure/mo AWS spend. We use pretty much every infrastructure piece in AWS in multiple regions and to replicate that flexibility in our own data centers would be an incredible cost.

Your workload doesn't need to be super constant either -- I think I ran the numbers some time ago, but for larger boxes, if you needed the machines for more than 8 hours a day, it was cheaper to run them in a DC and just leave them idle the rest of the time than to run them for that 8+ hours on AWS and shut them down for the rest of the time.

I would imagine that unless you have staff sitting on their hands, that level of hardware commitment would require at least one extra employee, generally at a cost of 6 figures a year.

Nope, these days servers pretty much run themselves. One just needs to make sure hardware is not failing, and that is pretty much it. Really. Speaking from experience.

Sure, though we're talking about a 6 figure/mo cloud spend. That's dozens or hundreds of servers. (+network config etc) I run our tech stack, and I can stand up instances pretty easily, but I had to rack and stack, manage power, bring in leased lines, I'd be well outside of my skill set.

Not 6 figures a month though.

Never said it was an equal cost. Merely pointing out that overhead expenses will cut into cost savings. Your cloud spend isn't just cost of hardware + cloud provider's profit margin.

A few years ago, I built a compute-intensive REST API: applied machine learning that did 3D reconstruction of human heads from images. After careful spec'ing of the system's requirements, it was clear that any of the cloud services were exponentially more expensive than running my own hardware. I needed a $96K per month spend at AWS, the lowest of the cloud services at the time, or I could spend ~$50K and build a globally scaled server cluster of my own. People speak as if there is some issue with locating quality devOps individuals, but I seem to find them seeking work all the time. So I hired one to help me build the cluster: he re-visioned my specs, adding a hardware firewall, I purchased the hardware, we assembled the cluster at a local co-working space, and then we installed it into a local ISP, in a rented cabinet. Total expense to build was about $65K (less than one month of AWS), and the monthly cabinet rental was $600 a month. And the month to month maintenance was a whopping 5-10 minutes a few times a week. This is NOT HARD FOLKS and it SAVES A SHIT TON OF MONEY. Don't believe the cloud hype - it is all hype.

Most apps are much more ops intensive per unit of compute than what you describe, which makes paying cloud compute costs to offload some large fraction of the ops workload to someone else more worthwhile than it seems to have been for your app.

And even in your case, I wonder if it would have gotten off the ground if you had to deal with infrastructure up front rather than as an efficiency move after the basic applocation was built, up, running, and stable: even if the cloud was mostly useful for ramping new systems up to the point where the system was basically stabilized and the infrastructure and ops needs were clear and could be assessed at leisure and efficiently provided with dedicated resources, that would still be a huge role.

One of the segment engineers addressed this here: https://news.ycombinator.com/item?id=13887218

tl;dr: They like the IaaS (ECS, Dynamo, ..), they need a lot of resources, and they're moving too fast to build it out in-house

Our spend is anything like Segment, but we've been using stax.io for tracking our AWS spend. Stax doesn't currently support all AWS services, but they're got the key ones. There are some nice auditing tools in stax for ensuring service config meets corporate standards.

It picks up on your AWS tags and then lets you allocate that to projects/departments/cost centres. Reporting on untagged service is available OOTB.

The business model is a 30 day trial and then you're charged a percentage of your AWS monthly spend. It worked out cheaper than building something ourselves.

Analysing AWS billing information is a bit of a 'dark art' in my experience. I wish they could improve some of their reporting tools to make it easier to reconcile assets.

For example, we switched to purchasing reserved instances up front to try and reduce our monthly bills (albeit at a higher up front cost). But trying to match up Reserved instances with EC2 or Elastic Beanstalk instances that are actually running is a real headache.

Several times, we've been caught with useless reserved instances sitting there not being utilised - because we had shut down a project or similar. Had we been able to see "Oh, we have 10 t1.micro reserved instances, but only 7 are being utilised by active EC2 instances at the moment" then we could easily know that we could provision another 3 t1.micro instances at no extra cost to experiment with something or for a client project. Or we could convert those t1.micro instances to another region or instance type to meet demand elsewhere.

It would also be handy on their billing reports if they delineated or showed the breakdown of the total usage hours against 'on demand' and 'reserved' instances. I know they _sort of_ do that now, but I think it could be better laid out so we can see at a glance whether we are getting best use from our reserved instances.

The Cost Explorer now has Reserved Instance utilization and coverage reports which are quite useful.

Thank you!! That RI Utilisation report is perfect! I have never delved into the Cost Explorer reports before, but this is exactly what we are after, and I appreciate you taking time to point this out.

I started project on github for loading detailed billing reports into elasticsearch so you can analyze the information with kibana: https://github.com/ProTip/aws-elk-billing . I don't really maintain it anymore, but a group of people have contributed quite a bit more to it over the past couple years. Some here may find it useful.

Would using something like http://cloudcheckr.com/ or https://www.cloudability.com/ have worked, too?

Aside from several other companies offering AWS cost analysis as a service, there is also Netflix ice (https://github.com/Netflix/ice) if one wants an open-source, self-hosted solution.

AWS also published a quick guide how to get the billing data into Redshift and use Quicksight to analyze it: https://aws.amazon.com/de/blogs/aws/new-upload-aws-cost-usag...

If you have the money to spend, Cloudcheckr, cloud health & cloudability are great. Some of their APIs are lacking which makes going with the DBR ingestion solution a good idea for large organizations.

For comparison, the invoice provided by Azure is inherently itemized. "X hours of Compute at Y rate", "X units of storage on PREMIUM/STANDARD storage type", etc.

Edit: Sample invoice: https://www.microsoft.com/en-us/download/details.aspx?id=388...

the azure billing report is a disaster; they seem to be slowly fixing it though. The line items are tricky and usually not super human friendly, multiple billable resources roll up to single items with weird quantities in some instances, if you have mixed ARM and ASM resources only the ARM stuff in your usage report will have data for tags (cause there's no concept of inheritance from parent resource group), ARM and ASM resources have different formats for some common fields, and tons more I've repressed from memory.

If you have an EA agreement, half (or more?) of the billing report interfaces don't work. If you have to use the EA portal due to an EA agreement....lets just say within the past year they have been through some rough periods of random undocumented format changes, magically disappearing billing reports for a few hours at a time, random intermixed windows and linux style line endings in the same file, periods where they are DAYS behind your actual usage, etc...

Source: I wrote a infrastructure inventory and billing/usage aggregation system that normalizes data from Azure, AWS, and vCenter to help our company easily track spending by service, tags, etc... across all of platforms where we have a presence.

That's what you get with the "Detailed Billing Reports" (https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2...) for AWS as well as mentioned in the article.

That still doesn't solve the problem of making sense of each item.

So what's the advantage of the Azure invoices?

Just use more than one AWS account!!!! Use Roles to manage any cross-account dependencies.

Does not help the author's particular use case with ECS.

Shameless plug, but it was the lack of clarity that we had with our AWS bills that prompted us to build https://cloudops.ai

We've got a prepaid service that allows you to always ensure your AWS bill never goes above certain amount as well as flat discounts on your monthly spend

The solution is one account per team https://stups.io/why

Installing ICE to see costs in detail https://github.com/Netflix/ice

To make sure that resources are compliant I would suggest using AWS Config Rules as they allow to specify for example that tags need to be applied. Validation of resources is done in a lambda function which should in theory allow to terminate the resources if they are not compliant.

If you are getting to the point of a large AWS spend wouldn't that presuppose that you might want to start moving on to on premesis physical servers ?

As someone who's a noob to AWS and is getting ready to launch a product on it, this is absolutely terrifying. As in what to expect in terms of billing.

Extremely unnecessary until your bill gets large, and/or you have many different teams using AWS

Anyone else see the irony of using aws services to understand your aws bill? It makes complete sense, just funny.

Very nice and detailed.

When did "spend" become a noun?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact