* An example CloudFormation from a Re:Invent (AWS conference) session silently failed to tear down some resources.
* Not trusting CloudFormation, I looked through each (known service, region) manually to make sure resources had been torn down. This failed to identify the running resources because a tutorial div opened in regions with no running resources and remained open if you switched to a region with running resources, hiding them.
* Not trusting my manual service tour, I kept a close eye on my daily costs until I saw several days pass with $0 spend. This failed because free tier credits were hiding substantial service usage.
* Not trusting any of the above, I had billing alerts set as a catch-all. They correctly triggered on an unrelated usage surge, but with such high latency that I incorrectly attributed their failure to reset to high latency rather than to a genuine underlying charge.
Bam, $700 charge next month. Amazon was quick to refund half of it. I was eventually able to get them to refund the other half by making waves in the support system of a high-spend business account.
At the last re:invent session I went to, I surveyed a table of 6 people. After sharing my $700 figure, 3 of the 6 came forward with even bigger numbers, 1 of the 6 with a smaller number, and the remaining person was a newbie.
Maybe there enough people out there who don't notice a $700 overage to offset that, but it seems like there would at least be a visible decrease in support costs if they had a better system.
Not everyone has a high spend account and a rep who brags about their policy of refunding mistakes frequently enough to snipe them with "If you have a policy of refunding mistakes, then why didn't you?" in front of customers.
I would like to emphasize that I've chosen my words carefully: I don't know for certain that the rep intervened, but I made trouble, emailed him the information necessary to resolve my issue, and then my half refund became a full refund. Any connection between these dots is pure speculation on my part.
The first 2 points in GP's post were Amazon's bugs. It's not a user mistake if the user fails to fix the vendor's bugs.
These aren't controversial, even though they reduce revenue, because they improve the customer experience.
Many government agencies have to go through 3rd parties ("lowest qualified bidder", which AWS often doesn't bid on) to pay our bill... And the contractors themselves use 3rd parties to figure out how much to bill us... so things like AWS bill alerts are not possible (imagine the whole billing section being permission denied while using the root account user). In addition, 3rd parties do not always provide great tools to set up bill alerts.
We have 40+ AWS account and can't track our spend without 10 button clicks per account which can't be done in parallel due to the apps browser caching locking us to working with one account at a time or it corrupts the cache. The process takes a minute per account, if the website doesn't crash trying to process our records (which are >3 million line item records for some accounts)
All that said: we basically were screwed at monitoring our bill.
In the fun that is semi-server less, we had a container running in ECS and logged directly into cloud watch. It went into an infinite loop late one month. It didn't really take our bill out of the expected ballpark when it was processed by the third party and delivered to us 20 days after the bill cycle closed
Next month however, by the time we got the bill, it had run for the ENTIRE month in an infinite loop, and already 75% of the next month. It pushed 50,000GB of an infinite loop log data at 50 cents per GB ingested the first month (plus storage costs). (That's about $30,000 for month 1)
AWS did not provide assistance because it was paid for on credits, but it basically ate all or remaining credits after the second month bill came in.
After that, we got the contractor to simply give us a copy of the "detailed billing reports" daily and built a process to make our own bill monitors (which at that stage had markups from the contractor). We eventually got somewhat of a better monitoring system through the third party app as well, but we were not aware that was possible because it was not accessible without the contractor setting it up for us (hidden menu options to combine accounts)
Call this a "distributed denial of responsibility attack". It's very convenient that there's no one point of easy blame, that means that nobody has to change.
I'm not complaining that it's something nefarious on AWS part, I'm saying it's designed in a way such that someone using various services of it can easily loose track of billing.
Yes, I could have closed the account altogether; but I didn't want to. Now I wonder, if AWS starts charging for the billing alerts itself whether I would catch it before I actually receive a billing alert for the billing alerts.
Similar happened to me at a Kubernetes/GCP tutorial. We started with a $100 credit. The tutorial included setting up massive (for tutorial purposes) instances. Because I had played with my account before and had created a single instance before, I hit account limits and my tutorial code failed to work. A frustrating experience richer a was very busy at work when returning from the conference.
When getting back to my tutorial code 3 weeks later, I noticed that less than $1 of my $100 were left. The n - 1 instances I created during the tutorial had been running for 3 weeks. User error of course, but at which SW job no errors happen?
The only positive about GCP I remember is that they promised not to overflow from the credit into real money from my credit card. At least that's how I remember it. Did not (need to) test, because I noted the issue $1 early. In AWS there is no such promise as we know.
Just two stories on HN I've seen about it:
I have nothing against GCP or Firebase, just countering the "this seems to be unique to Amazon" comment.
I know they selected a location in New York they shouldn't have / the community rejected, and after a big outcry ended up going with not against community pressure.
I think the one in Virginia is going ahead, they just got permits there for a metro station. Are there protests in Virginia they are not listening too?
So it was a colossal waste of time for 5,000+ people across North America. Burnt a lot of goodwill. Many cities learned a valuable lesson on dealing with these big multi-nationals as a result of it. So it won't likely happen like this again.
What did Amazon do different here? Instead of in back rooms, it was all out in the open.
Curious of there are other examples of this.
Other examples are federal grant programs - they are required to publicize them and lots of folks / nonprofits spend lots of time applying but reality is most will renew to existing partners etc
There were additional, negotiated real estate related credits that are also likely to be similarly negotiated by any other developer.
They were negotiating with the government - who writes the laws.
I was curious about the malfeasance - but if this is the the claim of criminality - uh... not a good look for the folks yelling at amazon.
I think MUCH stronger claims may be possible around just their fake product volume and consumer harm there, but good luck with these claims - are they being litigated anywhere?
We've gone for aws because they're supposed to have good customer support, but opaque cost reporting and the inability to inpose spending limits is a concern.
Does anyone know how azure/gcp:
1. Handle cost reporting
2. Handle spending limits (e.g. can I impose a hard spending limit per service/per user/globally?)
1. Azure basically has comparable cost reporting to Amazon, though has a cost aggregator if you want to use both Azure and AWS. I personally thought it didn't really bring all the nice features AWS billing had into Azure very well so I'd not recommend if your AWS usage is large and varied. I found GCP to have less features than either Azure or AWS for billing.
2. None of the 3 providers have a hard spending limit feature, though Google app engine service (not GCP) let's you shut it down. Other than that, permission roles are generally the same, AWS wins slightly on features again but Azure had a slightly nicer UI.
Anyways, you should do your own research on what cloud seems sane to you, and not let randos on HN make your business decisions.
This is absolutely terrible if you have a spike in legitimate traffic and try to increase the limit and you lose all that “front page of hn traffic” forever because your supposedly scalable system didnt scale.
I have no idea if they have fixed that issue yet but I doubt it.
2) No cloud has proper spending limits (aside from barebone compute like linode). It works on your bill continuing to get pricier as their overall multitenant costs come down.
* Data transfer will bite you in the ass if you let it. Especially over NAT gateway in very high traffic sites. So you do the right thing and put your application in private subnets, route traffic in over the load balancers and out over the NATGW. Then you get a $20k/mo bill for your microserviced application that has a hundreds of requests per second during peak hours. Pro-tip: the poo-pooed nat instances are actually a cheaper solution, but you're on the hook for maintaining it.
* The CUR can get huge. I mean millions and millions of lines. AWS says you can throw it into S3, query with Athena, etc. etc. But if that data set is huge even _that_ will cost you a lot of money to run reporting, analysis, etc. Especially after you build that dashboard for the refresh happy VP.
* The Cost Explorer is admittedly getting better, but still lacking a lot of necessary detail. You have to pair it up with CloudWatch to get actual cost and usage in a usable way. The value add services like EMR/Elasticsearch service/all the ML stuff do the hideous job of hiding actual usage. You gotta dig hard.
* The third party cost tracking tools (CloudHealth/Metricly/CloudAbility/Cloudyn) are just a wrapper around what you can get out of the CUR. Their value-add is reporting and advisement, and giving recommendations on right sizing and reserved instances and savings plans. Though if your cloud team is sufficiently savvy they can do this themselves.
* No matter how you do your analysis, tagging will make your life so much easier. Can't emphasize this enough.
I work for one of the vendors you mentioned; and while you're not wrong - the data sources are all from the vendors - there's a fair bit of work that goes into actually making sense of it to the point you can give it to people actually causing the spend. Also there's work that goes into optimisation, so that we can bear the cost of your second point.
Your last point is dead on though. For anyone doing cloud at any scale, tagging is non-optional if you want to do any kind of optimisation, chargeback or the like.
Do you have any recommended resources for reading or tips on how and what to tag in what way for making ones AWS Life easier?
The entire FinOps foundation is good for cloud finance management - I believe the author of that post has an O'Reilly book coming out this month on the subject also.
For tags, you can make any tag you want and summarize bills by tags... So anything take can be tagged is trackable.. But things like bandwidth are not.
It's also hard to enforce tagging when you can't automatically destroy non-complaint objects, so again, separate accounts help here.. if the sub-department wants to know their spend better, THEY are more likely to enforce the rule than A top-down policy from a disconnected IT group... And you can't simply apply a gonna "all things must be tagged" enforced in the AWS level because some items can't be tagged, or the tagging has to happen after creation (for instance, by SDK/cli, you can't create an ec2 instance with tags.. you make the instance, then tag it. The GUI does this behind the scenes so it looks like one step)
So again, for major booking boundaries, use different accounts. After that point, it's on the delegated entities to use tags appropriately... And it's often different for each group anyway.
You can automatically destroy non-compliant (with your tagging policy) objects, by querying objects that exist and examining their tags through the API (heck, you could even script the CLI to do this), and, if you use AWS Organizations, you can prevent noncompliant resources with a combination of service control policies (to require tagging) and tag policies (to specify use of tags).
> (for instance, by SDK/cli, you can't create an ec2 instance with tags.. you make the instance, then tag it.
That's...not true. The runinstances call in the SDK that creates one or more instances from an AMI takes an optional set of tag specifications for tags that can be applied to the instances and/or any of a wide variety of associated resources.
The accepted approach is warn then terminate. Give them an hour and then if nothing's done start the slaughter.
Parses the detailed billing logs into elastic search. Raw, but a good starting point..
Unfortunately EstimatedCharges only updates once a day and sometimes the Max udpates before the Min, triggering a false alarm. Obviously we could make it more reliable by using a 48 hour period, but then we'd only find out if something went haywire when it had been going for 2 days.
Really, how much would it cost them to run the cron job for EstimateCharges once an hour? Even less would be good (you can spend a lot on AWS in an hour).
It also stops working if you have a credit (until it runs out) so good luck if something goes wrong during that period.
We even asked a consultant (recommended by Amazon) if there was a more fine grained method, and they thought the only way would be using a third party service which ingests all the events and does its own estimate. This is nuts! If your charging is as fine grained as AWS is, so should your reporting be.
I spent quite some time in their cost explorer, but I don't understand a lot. Most days I have some positive costs, which is probably my usage. Some days I have negative cost, that's probably when they transfer credits from the voucher. They do this every end of month, but also irregulary some days when I use particular services and/or have higher than usual service.
It appears to me that the negative balances are a sum of costs of several days and the credit from the voucher. As a simpliplifed example I might see +1, +1, +1, -6. So I "reverse engineer" this as I "spent" 1, 1, 1, 3 and on the 4th day they credited 6 from my voucher. Too bad the 3 is not visible, I need to dig it out myself. In reality I use several services and they seem to credit them on different days. So the reverse engineering is not really possible. At least not without a major effort.
I remember many years ago it was possible to download hourly (probably also daily) usage reports. I.e. is usage in hours, KB, requests etc. not in money. I don't find them at all anymore. Anybody knows whether they still exist?
Also to my surprise I was billed 135 SQS requests last month. Well, I wasn't billed, because 1,000,000 are free. But my point is that I don't even know what SQS is and I am sure I haven't used it directly. It appears to me that they are "billing" me for their implementation details, because they might use those SQSes internally. Is that how it works? So if basically not using the services at all causes 135 requests, how much would that be if I really run some production there?
All in all, very opaque. Thank you AWS for the voucher, but I am not impressed about the billing transparency.
The reason being I have a library that reads a file in an x-size buffer along a file iteratively using bufio in go, and I'm not exactly sure what optimizations are happening that I can't see, and at some points I'm incrementing a file a byte at a time, thats by definition an IO/OP I think (super inefficient). Unfortunately a lot of the cloud metrics don't give you enough granularity or quick feedback to optimize.
Although the Cost & Usage Report alone can solve many billing mysteries, in some cases it's also necessary to go to CloudTrail logs to determine exactly which user or application incurred charges.
When it comes to web-serving, S3 is great for really bursty loads or lots of tiny ones.
When you have a lot of throughput all the time; S3 will be an expensive choice.
Our tool is free for any devs spending less than 60K a year on AWS. Let me know if you wanna test it out!
Edit: Have never tried it, but this open source "Ice" tool looks like it might be useful for smaller shops: https://github.com/Teevity/ice
But you might want to talk to people like @QuinnyPig from the Duckbill Group before you assume the fix to your AWS issues is a third party vendor's product.
Compromised account or server? It'd be interesting for their spam filters to catch most of it. But an accidental loop or issue in your code? (like another commenter mentioned with a $30k bill). Yikes.
Incredible to lack such a basic feature to better protect an account especially when money is involved.
Options seem limited from providers in general since Azure and GCP haven't done much better in this regard - GCP cloud billing in particular felt less far along than the other two providers.
People still treat their databases like mainframes; just make the machine bigger to get better performance, and use more reliable components to decrease the risk of data loss or downtime. Amazon is happy to take your money to manage that for you. Their profit margin exists in you being scared of what can go wrong, and the ease of getting things set up.
(I don't have a good suggestion for an alternative to this approach. The design of commonly-used relational databases surprises me. They all assume disks have some sort of intrinsic durability. Disk manufacturers all make you pay extra money to maintain this illusion; underprovisioning, wear leveling, background garbage collection, hardware RAID. But at the end of the day, it can all just get sucked up by a tornado (or a rogue shell script) and all that means nothing. I do not understand why disks are not just dumb blocks of NAND flash connected directly to the application, which can then provide cross-datacenter redundancy and save everyone a ton of money. I guess that is why Google makes their own SSDs and their own planet-scale database engine. They know it's silly. The rest of us are stuck with expensive garbage that is, with 100% certainty, going to fail. Who to blame is all that we can work around, and blaming Amazon is better than blaming yourself!)
Me to! HA is fiendishly difficult and often dramatically underestimated - until things go sideways and fail in unanticipated ways, of course :)
Doesn't it? EBS stuff (storage and IOPS) being a separate line item last I checked. Ephemeral storage (if applicable) is included with the compute price.
Also, is it just me or is RDS ridiculously expensive? I'm fairly new to AWS, having mostly used (and continue to use) Azure.
I seem to remember it being around what an EC2 instance cost until you go to multi-AZ and then you're paying for an extra instance. But I've only used RDS with postgres and mysql type engines, none of the proprietary stuff that would add on extra licensing fees.
For the m4 series, RDS is almost double the instance cost.
The issue with AWS is that it's easy to add new services without finding new vendors so companies just spend more and more on AWS as features are not as important as 'cost savings'.
Whereas Azure disk costs for VMs accrue on a daily basis.
RDS is a managed database "cluster" (in the PG sense of the word) running on EC2 infra. You need to define both compute and storage. Backups are snapshots, not PIT. Replicas are via the standard DB engine replication,
I started on free.fr, my parents' ISP, with a web page built in iWeb. But then I wanted a .com domain.
I moved to Wordpress, but that didn't let me customise the layout.
Then I moved to Google App Engine (appspot), which is good, but it's blocked in China.
Then I moved to OpenShift (rhcloud), which was great while it lasted. It wasn't just for hosting a static web page - I could SSH into the server, at last! But it shut down the free tier.
I tried Heroku, but I'm getting warnings about 80% of monthly usage even when there's no content there, just a redirecting page.
Currently I'm using Github Pages, but I worry about how Microsoft will try monetising that. It's also a pity not to have SSH, FTP, or SQL - all my apps (e.g. Pingtype) have to run on the client in localstorage.
The company I'm working for spends over $3000 per month on AWS.
Whenever I read about these "free giveaway" AWS coupons that require registering with a credit card, I just think that there's going to be a nasty fee like this. So in practice I just run things locally on my laptop. If there's a better provider, please tell me though! It's been a few years since I last checked the options, and moved everything over to Github.
If you don't get a lot of traffic, I'd go for an Amazon LightSail instance. $5/month includes 1 TB of data transfer. If you really don't like AWS, Digital Ocean has similar offerings.
Does AWS allow you to pay with pre-paid Visa/MC cards? Could use one of those to pay for the account to avoid a surprise bill from draining your bank account.
These are sites using Hugo, static HTML, and simple build processes (TypeScript compiling).
Disclaimer: I have not personally been involved using lambdas for anything serious, so my experience is limited.