"With range requests, the client can request to retrieve a part of a file, but not the entire file. ... Due to the way AWS calculates egress costs the transfer of the entire file is billed." WTF if true.
This must be a regression bug in AWS's internal system. At a past job (2020) we used S3 to store a large amount of genomic data, and a web application read range requests to visualize tiny segments of the genetic sequence in relevant genes - like 5kb out of 50GB. If AWS had billed the cost of an entire genome/exome every time we did that, we would have noticed. I monitored costs pretty closely, S3 was never a problem compared to EC2.
It also seemed like the root cause was an interrupted range request (although I wasn't fully clear on that). Even so that seems like a recent regression. It took me ages to get that stupid app working, I interrupted a lot of range requests :)
You are right, this is about canceling range requests and still getting billed, not about requesting ranges and getting billed for the complete file egress. Sorry; we'll make the post clearer.
Yes, it was client-side JavaScript making the range requests, asking for a string of genomic data to render in the browser. It was only to give the scientists a pretty picture :) The EC2 costs were largely ElasticSearch for a different function, which never looked at the data in S3.
That sounds egregious enough that I have trouble believing this can be correct. My understanding is that AWS bills for egress for every service, parts of the file that aren't transferred are not part of this so can't be billed. There could certainly be S3-specific charges that affect cases like this, no idea. But if AWS bills the full egress traffic costs for a range request I'd consider that essentially fraud.
Sorry, I think that part of our write-up is misleading (I was involved in analyzing the issue described here). To our best understanding, what happens is the following:
- A client sends range requests and cancels them quickly.
- The full range request data will be billed (NOT the whole file), so I think this should read that the entire requested range gets billed, even if it never gets transferred (the explanation we received for this is that it's due to some internal buffering S3 is doing, and they do count this as egress).
In any case, if you send and cancel such requests quickly (which is easy enough, this was not even an adversarial situation, just a bug in some client API code) the egress cost is many times higher than your theoretical bandwidth (and about 80x higher than in the AWS documentation, hence the blogpost).
This is a problem with lots of services. Blocking large quantities of legitimate looking requests is a hard problem. Request cancellation is also tricky and not supported well in a lot of frameworks/programming languages.
But the thing you are supposedly getting billed for is data transferred to the Internet. If the connection is closed there might be some data that goes out before the reset is received, but not that much. So either this is a bug, or the pricing documention is ... incorrect about what they actually charge you for.
Sadly this is no bug. Customers are NOT billed for actual data transferred. Customers are billed for "some kind of" data requested. If you interrupt the data transfer before downloading all requested data then it's "your fault".
AWS documents this detail somewhere hidden on the S3 pricing page (https://aws.amazon.com/s3/pricing/). Search for "Data Transfer Out may be different from the data received" in the "Data transfer" tab.
It like if the gaz compagny billed you for more gaz your pipe can physically take. No amount of fine print can change physics. If they bill for some opaque internal metric they need to change the wording to reflect this on all documentation: "we will bill you for whatever we please, you can't audit and we won't explain how it works".
That explanation still makes it sound like you are charged for what is transferred to the Internet, but your application might not get some of it because it already closed the connection.
From the OP it sounds like the amount billed is too high to be explained by that.
If what you are billed for isn't actually data transferred, then it is deceptive to say that you are billed for "data transferred out".
Wait. What if the range is bigger than the file? Say if I made a request for a 5 TB range but the file is actually only 1kb. Would you get charged for 5 TB of egress?
AWS user believes that testing on a 1Gbps connection for 45 min can't be more than $10 of egress.
Gets a $500 bill instead.
Note: This user specified a lower range but not an upper range on the request (and closed the connection prematurely). Essentially read() with an offset, for a ZIP tool.
There is probably a small area where it's difficult to measure, so I would not expect billing to be exact to the byte here. But billing for the requested range if not the entire range was actually transferred is just not correct and not acceptable.
The part where I think there is some flexibility is about the difference between "bytes attempted to transfer" and "bytes actually transferred". I think it is pretty fair to bill for the former, as long as you abort requests in a reasonable way. So I don't expect it to be billed exactly by the transferred byte, but I do expect it to not go above that higher than whatever the chunk size for transferring is.
That’s an orthogonal issue. There’s no interpretation of “egress” that means “stuff we do internally before leaving aws data centers”. If the tcp conn is reset only a few MB would leave aws frontend servers. Instead, it appears they’ve been basing the number off the range in the request and/or whatever internal caching/loading they’re doing within S3, which again has nothing to do with egress.
I mean, we already know egress is short for egregious. It’s an incredibly bad look to be overestimating the “fuck you” part of the bill.
More or less. The article quotes AWS as saying the following:
> Amazon S3 attempts to stop the streaming of data, but it does not happen instantaneously.
...which doesn't really explain it. It shouldn't send more than a TCP window after the connection is closed, and TCP windows are at most 1 GiB [1], usually much less, so this completely fails to explain the article's observed 3 TB sent vs 130 TB billed.
The article goes on to say:
> Okay, this is half the explanation. AWS customers are not billed for the data actually transferred to the Internet but instead for some amount of data that is cached internally.
In other words, how much they bill really isn't bounded by how much is sent at all. This is unacceptable.
Obviously they break the problem down into questions like "what's the maximum that could be transferred between any two components? what's the most expensive components this transits?" and pick the most expensive answers without any attempt at coherence.
> this completely fails to explain the article's observed 3 TB sent vs 130 TB billed
I interpreted that to be that their code was doing this over and over again, so in total they retrieved 3TB over a set of requests. Still horrifying, but mildly more explainable.
this can be explained that this may be not egress out of AWS, but egress out of S3 system itself.
S3 is a block storage, so retrieving an object for such a high availability and high perf service means it tries to pull some X block of data and cache it before sending through the socket.
That X block of data is out of internal S3 storage, just not sent through the bigger Internet egress subsystem.
So technically aws may argue this is egress for s3, just not for aws
The consequence of your interpretation is that surprise billing is not even bounded by the physical limitations. This is absolutely not reasonable. I am the user with in the zip case. I knew this testing would incur a cost, but I could imagine it could be 20x my network speed and I have absolutely no control over what AWS claims to meter.
If they charged for the extents that are actually read from S3 (even if the network connection was just closed), that'd be fine. But charging for terabytes when the actual connection consumed a few kilobytes is still wrong; surely they didn't e.g. buffer those terabytes of data from S3 disk storage into memory.
s3 is a complex system, you could be hitting a different node with subsequent requests where this cache entry does not exist yet.
if you think egress is expensive, well storing data in RAM for cache purposes is 1000000x more expensive
a lot of stuff could be happening. Main problem is AWS (i think) is charging for egress out of S3 system, but customers are looking at their ingress at client side and there is mismatch
The way billing is calculated should be clearly labeled along with the pricing. Azure does this too, it's super unclear what metric they're using to determine what will be billed for requests. We're having to find out via trial and error. If we request 0-2GB on a 6GB file, but the client cancels after 400MB. Are we paying 2GB or 400MB or 6GB?
Is there a billed difference between Range: 0-, no "Range" header, and Range: 0-1GB if the client downloads 400MB in each scenario?
Sorry for not having this made clearer (we'll fix this part of the post): the gotcha is not that AWS does not honor range requests, it's that canceling those will still add the full range of bytes to your egress bill (and this can add up quickly) although no bytes (or much fewer) have been transferred.
On the other hand you did ask for them so what does it mean “canceling”? Just playing devil’s advocate that they did likely start getting the data for you and that takes resources. Otherwise they would be open to a DOS attack that initiates many requests and then cancels them.
Sure, that's true. The thing is: this was the same requested (and cancelled) range on the same file(s), over and over (it was a bug). Looking at this from the outside, even some internal S3 caching should have had many cache hits and not have to re-download the requested ranges internally all the time (there were dozens of identical requests per second, immediately being cancelled).
On top of this, S3 already bills (separately) for any request against a bucket (see the other current issue with the invalid PUT requests against a secured bucket, which still got billed to the bucket owner; https://news.ycombinator.com/item?id=40203126). So I'd say both the requests and the cancellations were already paid for; the surprise was the 'egress' cost on top, of data that was not actually leaving the AWS network.
Still, you are right that this still consumes some additional AWS resources, and it is probably a non-trivial issue to fix in the 'billing system'.
Azure is probably the most egregious example of this, AWS and GCP can at least claim they have architectural barriers to implementing a hard spending cap, but Azure already has one and arbitrarily only allows certain subscription types to use it. If you have a student account then you get a certain amount of credit each month and if you over-spend it then most services are automatically suspended until the next month, unless you explicitly opt-out of the spending limit and commit to paying the excess out of pocket. However if you have a standard account you're not allowed to set a spending limit for, uh... reasons.
That's insane as well. They already built the system, but you just can't use it because we want the option for you to screw up and pad our billing. There are many projects I've worked on where a service not being available until the 1st of the next month would not be anything more than a minor annoyance, and would much rather that happen than an unexpected bill. This is also something that I think would be a nice CYA tool when developing something in the cloud for the first time. It's easy to make a mistake when learning cloud services that could be expensive like TFA shows.
"Thank you to everyone who brought this article to our attention. We agree that customers should not have to pay for unauthorized requests that they did not initiate. We’ll have more to share on exactly how we’ll help prevent these charges shortly." — Jeff Barr, Chief Evangelist, Amazon Web Services
AWS APIs need a cleanup. I am constantly running into issues not documented in the official doc, boto3 docs, or even on StackOverflow. It's not even funny when a whole day goes by on trying to figure out why I see nothing in the body of a 200 OK response when I request data which I know is there in the bowels of AWS. Then it turns out that one param doesn't allow values below a certain number, even though the docs say otherwise.
Historically, they've been scared of versioning their APIs (not many services have done it, dynamodb has, for example).
It leads to a "bad customer experience", having to update lots of code, and also increases maintenance costs while you keep two separate code paths functional.
There's a lot about the S3 API that would be changed, including the response codes etc., if S3 engineers had freedom to change it! I remember many conversations on the topic when I worked alongside them in AWS.
It’s quite insane the levels of effort S3 engineers put in to maintain perfect API compatibility. Even tiny details such as whitespace or ordering have messed up project timelines and blocked important launches.
That could be the same root cause. You download data via range request but with no upper bound and AWS is billing you for very much more than you downloaded in reality.
I can assure you this was not AI-generated, apart from the 'symbolic image' (which should be fairly obvious :).
Maybe that's just our non-native English shining through. In any case, as a small European company in the healthcare space, we are quite used to having to explain "the cloud" (with all potential and pitfalls) to our customers. They are also (part of) the target audience for this post, hence the additional explanations.
(Not OP and not author of the article, but was involved in the write-up.)
If writers don't want people to think their content is AI generated, maybe they shouldn't put ugly AI generated images on top of everything they write.
Sounds like they were using the Range header on large files. I have made systems in the past using exactly this pattern(without the intentionally dropped requests).
I hope this doesn't result in any significant changes as I really liked using this pattern for sequential data processing of potentially large blobs.
Early Athena (managed prestodb by AWS) had a similar bug when measuring colunar file scans. If it touched the file, it considered the whole file instead of just the column chunks read. If I’m not mistaken, this was a bug on presto itself, but it was a simple patch that landed on upstream a long time before we did the tests. This was the first and only time we considered using a relatively early AWS product. It was so bad that our half assed self deployed version outperformed Athena by every metric that we cared about
Jeff Barr posted that AWS is actively working on a resolution for this: https://twitter.com/jeffbarr/status/1785386554372042890 . Given who he is, I take this as a strong indication that there will be a reasonable fix in the near future.
Lots of reasons. My company started using AWS (and specifically S3) something like 9 years ago; R2 wasn't even on the radar back then. If I were starting from scratch today, I'd be looking seriously at Cloudflare as a platform, but it's only in the last year or two that they've offered these services that would make it possible to build substantial applications.
R2 also doesn't have all the features that S3 does - including an equivalent of S3 Glacier, which is cheaper storage than R2. R2 also doesn't have object tagging, object-level permissions, or object locking. Sure, you could build your own layer in front of R2 that gives you these features, but are you necessarily saving money over just using S3?
inertia, mostly. If you're already using AWS and have your system set up, adding a new vendor is going to be a lot of extra work compared to just using what's available.
CloudFront charges for all requests, so a 404 Not Found, will be counted towards the total number of requests, and you will be billed accordingly. Hopefully somebody will prove me wrong :-)
"Denial of Wallet" seems a misnomer--it makes it sound like source of payment is being blocked. They should really use the same term cellular systems have been for decades to describe this kind of threat, namely an "overbilling attack".
We use CloudFront and we deny public users the ability to access S3 directly. You can even use Signed URLs with CloudFront if you like. I'm not sure I'd evere feel comfortable letting the public at large hit my S3 endpoints.
As it should be, but recently on HN it was posted that AWS will charge you for any unauthorized PUT request to your S3 buckets. Meaning even 4xx errors will rack up a charge.
So your S3 bucket names must be hidden passphrases now that stand between an attacker and your budget.
Also, the cost of doing this per request is insane compared to either absorbing or rate-limiting the bandwith the requests take.
Cloud computing charges you by the request/byte/cpu cycle. Servers do not have this issue.
Also, is it simply not possible to rate limit this on a per IP basis? Make client only able to do X requests per second from each unique IP/network flow.
>Cloud computing charges you by the request/byte/cpu cycle. Servers do not have this issue.
Sure they do. Processing requests takes bandwidth, CPU, memory, disk I/O
>Also, is it simply not possible to rate limit this on a per IP basis
It's largely useless. You'll block any legitimate bits/programs, people on CGNAT, people on corporate networks & bad actors will use botnets, residential IPs, VPNs to gain access to thousands or millions of unique IPs