Hacker News new | past | comments | ask | show | jobs | submit login
From S3 to R2: An economic opportunity (dansdatathoughts.substack.com)
274 points by dangoldin on Nov 2, 2023 | hide | past | favorite | 173 comments



Cloudflare has been attacking the S3 egress problem by creating Sippy: https://developers.cloudflare.com/r2/data-migration/sippy/

It allows you to incrementally migrate off of providers like S3 and onto the egress-free Cloudflare R2. Very clever idea.

He calls R2 an undiscovered gem and IMO this is the gem's undiscovered gem. (Understandable since Sippy is very new and still in beta)


Author here and really cool link to Sippy. I love the idea here since you're really migrating data as needed so the cost you incur is really a function of the workload. It's basically acting as a caching layer.


CacheReserve is a neat way to relegate S3 and front it with R2 instead: https://developers.cloudflare.com/cache/advanced-configurati...

And, if you're in the same boat as someone down-thread complaining about Cloudflare's uptime in recent weeks, you can keep S3 + Cloudfront (or Lightsail Buckets + Lightsail CDN) and S3 + CacheReserve on R2, which is what we do, and flip between them with DNS.


What are the economics that Amazon and other providers have egress fees and R2 doesn't? Is it acting as a loss leader or does this model still make money for CloudFlare?


Amazon doesn't have unit cost for egress. They charge you for the stuff you put through their pipe, while paying their transit providers only for the size of the pipe (or more often, not paying them anything since they just peer directly with them at an exchange point).

Amazon uses $/gb as a price gouging mechanism and also a QoS constraint. Every bit you send through their pipe is basically printing money for them, but they don't want to give you a reserved fraction of the pipe because then other people can't push their bits through that fraction. So they get the most efficient utilization by charging for the stuff you send through it, ripping everybody off equally.

Also, this way it's not cost effective to build a competitor to Amazon (or any bandwidth intensive business like a CDN or VPN) on top of Amazon itself. You fundamentally need to charge more by adding a layer of virtualization, which means "PaaS" companies built on Amazon are never a threat to AWS and actually symbiotically grow the revenue of the ecosystem by passing the price gouging onto their own customers.


AWS egress charges blatantly take advantage of people who have never bought transit or done peering.

To them "that's just what bandwidth costs" but anyone who's worked with this stuff (sounds like you and I both) can do the quick math and see what kind of money printing machine this scheme is.


It's also a way to choose your customers.

Some people want to host a lot of warez and pirate movies and stuff but that doesn't monetize very well per GB consumed so pricing bandwidth high means those people never show up, thus saving a lot of trouble for AWS.

I remember when salesforce.com announced a service that would let you serve up web pages out of their database, it was priced crazy high (100-1000x too much) from the viewpoint of "I want to run a blog on this service" but for someone who wanted to put in a form to collect data from customers it was totally affordable. Salesforce knew which customers it wanted and priced accordingly.


You don't get charge for transit if you are sending stuff IN from the internet or to any other AWS resource in that region. So there is no QOS constraint inside except for perhaps paying for the S3 GET/SELECT/LIST costs.

It is pretty much exclusively to lock you into their services. It heavily impacts multi-cloud and outside of AWS service decisions when your data lives in AWS and is taxed at 5-9 cents a GB to come out. We have settled for inferior AWS solutions at times because the cost of moving things out is prohibitive (IE AWS Backup vs other providers)


It also makes things like just using RDS for your managed database and having compute nearby but with another provider often incredibly expensive.


Author here - have you tried using R2? As others mentioned there's also Sippy (https://developers.cloudflare.com/r2/data-migration/sippy/) which makes this easy to try.


Honest question, how is this different than a toll road? An entity creates a road network with a certain size (lanes, capacity/hour, literal traffic) and pays for it by charging individual cars put through the road.


The toll road is the only way out of the county and it's charging $90. That's what's different.


There's at least a couple of reasons that your analogy doesn't really work.

First a lot of these roads are 'free' and yet you're still being charged for it. If two large networks come to an agreement then they connect the two networks (ie build that road), but no money changes hands.

Second if there is a paid peering agreement in place (ie say AWS had a cost to push your data out), that still wouldn't be billed to them in the way they're charging you. Instead they'd be paying for the rate of traffic at something like the 95th percentile of the max. This means that you could download a petabyte of data from them when the pipe isn't busy and cost them nothing, or you could download a gigabyte when it's busy and push up the costs.


The difference is that Amazon doesn’t own the road, they’re just a truck driver. Amazon customers rent space on the truck and pay whatever the driver asks them for.


Yeah ok but you still need a ticket to board the truck/bus, right? The more people in your family, the more tickets.

The issue isn’t charging for egress, but charging excessively.


Or, really, any capital intensive business that makes money through operating costs based on usage rather than total capacity.


You pay for the capacity of your network.

Cloudflare has huge ingress, because they need it to protect sites against DDOS.

They basically already pay for their R2 bandwidth ( = egress) because of that.

Additionally, with their SDN ( software defined networking) they can fine-tune some of the Data-Flow/bandwidth too.

That's how I understood it, fyi.

Some more info could be found when they started ( or co-founded, not sure) the bandwidth alliance.

Eg.

https://blog.cloudflare.com/aws-egregious-egress/

https://blog.cloudflare.com/bandwidth-alliance/


somebody more knowledgeaeble please correct me if i'm mistaken, but i think the bandwidth alliance is really the lynchpin of the whole thing. basically get all the non-AWS players in the room and agree on zero rating traffic between each other, to provide a credible alternative to AWS networks


Also, for the CDN case that R2 seems to be targeting - regardless of the origin of the data (R2 or S3), chances are pretty good that Cloudflare is already paying for the egress anyway.


It's actually worse than that.

In the CDN case Cloudflare has to fetch it from the origin, cache (store) it anyway, and then egress it. By charging for R2 they're moving that cost center to a profit one.


I'm not sure about that.

A CDN keeps the data nearby, reducing the need to pay egress to the big bandwidth providers.

( not an expert though)


Let's say you want to use cloudflare, or another CDN. The process is pretty simple.

You setup your website and preferably DON'T have it talk to anyone other than the CDN.

You then point your DNS to wherever the CDN tells you to. (Or let them take over DNS. Depends on the provider.)

The CDN then will fetch data from your site and cache it, as needed.

Your site is the "origin", in CDN speak.

If Cloudflare can move the origin within their network, there is huge cost savings and reliability increases there. This is game changing stuff. Do not under estimate it.


Completely free egress is a loss leader, but the true cost is so little (at least 90x less than what AWS charges) that it pays for itself in the form of more CloudFlare marketshare/mindshare.


I know from personal experience that "big" customers can negotiate incredible discounts on egress bandwidth as well. 90-95% discount is not impossible, only "retail" customers pay the sticker price.


That's still a 3-10x markup though. And it's also very dependent on your relationship with AWS. What happens if they don't offer the discount on renewal?


Cloudflare wrote a blog post about their bandwidth egress charges in different parts of the world: https://blog.cloudflare.com/the-relative-cost-of-bandwidth-a...

The original post also includes a link to a more recent Cloudflare blog post on AWS bandwidth charges: https://blog.cloudflare.com/aws-egregious-egress/


I’m inherently suspicious of services that are free (like Cloudflare egress). Maybe I’ve been burned too many times over the years, but I almost expect some kind of hostility or u-turn in the long run (I do really like Cloudflare’s products right now!).

I almost wish they had some kind of sustainable usage-based charge that was much lower than AWS.

Feel free to tell me why I’m wrong! I’d love to jump onboard - it just seems too good to be true in the long-term.


Because they're a CDN. You pay for storage already, so an object that isn't downloaded much is paid for. An object that gets downloaded a lot uses bandwidth, but the more popular it is, the more effective the CDN caching is.

There probably needs to be an abuse prevention rate limit (and probably is), but it's not quite as crazy as it sounds to just rely on their CDN bandwidth sharing policies instead of charging.


What happens if I host an incredibly popular file, and start eating up everyone else’s share of the bandwidth? ie - I become a popular Linux distro package mirror?

I do think there are “soft limits” in place like you say - it’s just my personal preference to have documented limits (or pay fairly for what you use). IMO it helps stop abuse, and prevents billing surprises for legitimate heavy use-cases.


They undoubtedly limit the % of bandwidth you can use when the link is full. The problem with that is that it's very hard to quantify, because whether or not they have spare bandwidth for you depends a lot on location, timing, and what else is happening on the network.

But that's really no different from the guarantee you get from most CDN services. If you're using cloudflare in front of S3, for example, you'll end up with the same behavior.


> But that's really no different from the guarantee you get from most CDN services. If you're using cloudflare in front of S3, for example, you'll end up with the same behavior.

But in my mind it’s also comforting that something like Cloudfront has a long-term sustainable model (I should also add with fewer strings attached like hosting video).

I do think the prices ant AWS are too high, but it discourages bad actors from filling up the shared pipes. ISPs are sometimes a classic example of what happens when a link is over subscribed.

Cloudflare’s “soft limits” are also somewhat of a dark pattern if you ask me. I like to know exactly how much something will cost, and it’s really hard to figure out with Cloudflare if you’re a high-traffic source. Do I hit the “soft limits,” or not? It’s really hard to say with their current model.

FWIW, I think Cloudflare is a great product right now - I am just skeptical they can keep it up forever.


also, egress fees are a sort of vendor lock-in, because getting data out of the cloud is vastly more expensive then putting new data into the cloud..


The big cloud providers are Hotel California - you can check in but you can't check out.

Of course you can (like Snap) but it's a MASSIVE engineering effort and initial expense.


Exactly this. Data has gravity, and this increases the gravity around data stored at Amazon...making it more likely for you to buy more compute/services at Amazon.


very true, but data gets stale very quickly. So you start putting new data in a new place. Eventually, you don't care about the old place. And all the people and processes who accessed the data in the old place are gone.


Completely agreed about data gravity, but it's not just that, it's also customer opted-in vendor-lockin.

The customer (because they are lazy, don't know better, aren't capable of, or all three) opts in to use various "convenient" CSP "services". These services could look convenient (and are always pretty to extremely expensive), they quickly becomes an integral part of the customer's badly architected "system".

The end result is complete vendor-lockin, the inability of the poor (stupid) user to leave and the continued gang rape of their bank account (also via additional, incompetent developer and devops "resources").

Throw in average modern "devops" who are hired to handle this. They aren't like the sysadmin of yesteryear, they no longer have experience with, or understand the bits and bytes. They are glorified UI clickers and YAML editors, they even lack any reasonable system level debugging skills. For every problem they encounter they first immediately run to google in search for answers.

In addition, I would argue that CSPs are a huge, huge waste of computing, space and power resources, because their systems completely encourage people to just do things, without understanding what they are doing, screw the consequences and just pay.

Result, the business suffers greatly (on so many levels), the CSP wins big and continues winning.

What happens here is that a system, if designed right from the get go, could have been run on a SINGLE, modern, high end, well positioned and connected server to the Internet, is now replaced with tens to hundreds of "instances" and random assorted CSP provided services -- what a colossal waste.

Books can be written on negligence, lack of understanding, utter tech stupidity and ultimately the costs which are absurd.


It is a colossal waste of resources, indeed.

It's also a huge waste of human effort managing the complexity introduced by the cloud provider's arbitrary bullshit.

At this point multiple generations of engineers have little understanding of underlying layers of technology, having only really learned how to use cloud services. No TCP/IP, no UNIX, just a bit of bash and a ton of AWS.

Cloud providers do hide most of the low level complexity, which could be seen as a benefit (at least that seems to be what's touted as a main benefit, along with instant scalability.) Unfortunately they replace all of that with more arbitrary complexity which is ultimately (in my opinion, at least) a much bigger burden than the fundamental complexity that is abstracted away.


Greed on the cloud providers part, I think. You'd expect egress fees to enable cheaper compute, but there are other cloud providers out there like Hetzner with cheaper compute and egress, so the economics don't really add up.


Indeed, Hetzner is so much cheaper that if you have high S3 egress fees you can rent Hetzner boxes to sit in front of your S3 deployment as caching proxies and get a lot of extra "free" compute on top.

It's an option that's often been attractive if/when you didn't want the hassle of building out something that could provide S3 level durability yourself. But with more/cheaper S3 competitors it's becoming a significantly less attractive option.


Scaleway also, and they are fully S3 compatible. I use their glacier service for backup. I store 1.5TB for around 3€ per month.

I used the storage box from hetzner before but they only had 1TB or 5TB (and higher) choices so I had to pay for 5TB (€12 per month) without using most of it. Having rsync support was nice but rclone works fine with S3.


The way to reduce s3 egress fees is to use CloudFront, negotiate your cloudfront fees down, then use s3 as the origin.


There has to be more to it than a pure loss leader, since there's also the Bandwidth Alliance Cloudflare is in, which allows R2 competitors like Backblaze B2 to also offer free egress, which benefits those competitors while weakening the incentive for R2 somewhat.


Clever


Here’s a tweet from Corey Quinn describing how bonkers R2 pricing is:

> let’s remember that the internet is 1-to-many. If 1 million people download that 1GB this month, my cost with @cloudflare R2 this way rounds up to 13¢. With @awscloud S3 it’s $59,247.52.

https://x.com/quinnypig/status/1443076111651401731?s=46


I left AWS in 2019, so my knowledge on the current recommendations & pricing is dated. But even back then we were strongly discouraging usage like this for S3 for both security and cost reasons. Cloudfront should be in front of your bucket serving the objects, and IIRC it’ll be 75% cheaper in most cases. Still doesn’t bring it within even a couple of orders of magnitude of the R2 price, but this comparison does feel like it’s painting a best case versus a worst case. And the worst case being an approach that goes against best practice recommendations that are at least half a decade old at this point (I will concede people absolutely still do it though!).


Don’t those volume discounts kick in once you’re doing near 1PB? You’re still paying “normal” CloudFront prices all the way up and it also varies per region :$

https://aws.amazon.com/cloudfront/pricing/


There’s definitely volume discounts for CloudFront too. Or, there was. As I said my intimate knowledge here is years old.


To be fair, 1 million downloads @ 1GB is a lot of data transfer. CloudFlare is likely losing money on this.


Why would they?

Cloudflare doesn't pay for egress and neither does AWS.


But their egress capacity is limited, not? We're talking about 1PB per month here. If every customer of them would be paying only 13 cents a month and pushing out 1PB per month, wouldn't they need to significantly upgrade their hardware and lose money in the process?


They are already serving roughly 20% of internet traffic as is so there is some natural limit to this whole thing.


Yup. Bandwidth at scale is effectively free.

The greatest trick AWS ever pulled was convincing the world you needed to pay for bandwidth.


Well you’ve won me over.


Just as an fyi, eastDakota is in Cloudflare’s executive team. Think he’s their CEO.

Not saying not to trust him - he’s probably a very reasonable and standup guy - but you should know this about him before taking his word on a topic like this.

No disrespect @eastDakota


"If every customer of them would be paying only 13 cents a month and pushing out 1PB per month, wouldn't they need to significantly upgrade their hardware and lose money in the process?"

Yes, but every customer would never do this, so what is your point?

You have to think more in terms of averages for things like this


> If every customer of them would be paying only 13 cents a month and pushing out 1PB per month...

Which is never going to happen for legitimate use cases.

And Cloudflare has DDOS protection for ilegitimate ones.


I'm abusing the hell out of it right now offering GB+ downloads that I used to use Digital Ocean Spaces for. It's saving me $2000-3000 a month since the switch.

Maybe abuse isn't the right word but definitely making the most.

I am a bit scared about being turned off overnight though.


Not abuse. Thanks for being a customer. Bandwidth at scale is effectively free.


Just for the sake of enlightening some people. Roughly $1000 per month buys you unlimited/unmetered 10GBe (10GBps) connectivity to your server/rack (do you know what this is?), from a tier-1 network provider.

This translates to roughly 1.2 gigabytes per second (every second of of the month), and 3240 terabytes of data per month - in or out, the choice is yours.

Things scale down as you buy more bandwidth, or commit to a longer contract.

Many would say that $1000 per month is literally "nothing" in terms of costs of service for most real businesses our there, and if you're a happy CSP user, you're probably paying a hell of a lot more than that per month for your infra.


Same here. What's your bandwidth usage? 500TB/month here for less than $4, on track to be serving petabytes in next few months. Feels so abusive.


Yes R2 is likely loosing money in this case. But network capacity and switches is not that expensive the way AWS is charging for it. For $60k/month or $720k/year, AWS is basically giving 3GB/s.

I feel R2 should charge something for transfer though, otherwise people could abuse it. Hetzner charges ~1.5% of AWS egress fees which I feel is right thing to do and likely profitable.


One hour 4K Netflix episode would be around 1Gb magnitude and likely watched by even more than 1M ppl. Game downloads even bigger, often at several Gb, with similar amount of users.

Though not everyone is Netflix.


Almost no one is


Netflix serves all of this data from their caches, very close to end users, paying probably nothing for said bandwidth.


Won't you have the same problem with R2, once your data is in that provider?


No, because CloudFlare doesn’t charge for egress traffic.


As an indie dev, I recommend R2 highly. No egress is the killer feature. I started using R2 earlier this year for my AI transcription service TurboScribe (https://turboscribe.ai/). Users upload audio/video files directly to R2 buckets (sometimes many large, multi-GB files), which are then transferred to a compute provider for transcription. No vendor lock-in for my compute (ingress is free/cheap pretty much everywhere) and I can easily move workloads across multiple providers. Users can even re-download their (again, potentially large) files with a simple signed R2 URL (again, no egress fees).

I'm also a Backblaze B2 customer, which I also highly recommend and has slightly different trade-offs (R2 is slightly faster in my experience, but B2 is 2-3x cheaper storage, so I use it mostly for backups other files that I'm likely to store a long time).


Have you looked into Worker AI? I’m actually curious to know what the cost difference would be for your compute setup vs. Workers AI


The premise of Workers AI is really cool and I'm excited to see where it goes. It would need other features (custom code, custom models, etc) to make it worth considering for my needs, but I love that CF is building stuff like this.


Is there any reason to not use R2 over a competing storage service? I already use Cloudflare for lots of other things, and don't personally care all that much about the "Cloudflare's near-monopoly as a web intermediary is dangerous" arguments or anything like that.


1. This is the most obvious one, but S3 access control is done via IAM. For better or for worse, IAM has a lot of functionality. I can configure a specific EC2 instance to have access to a specific file in S3 without the need to deal with API keys and such. I can search CloudTrail for all the times a specific user read a certain file.

2. R2 doesn't support file versioning like S3. As I understand it, Wasabi supports it.

3. R2's storage pricing is designed for frequently accessed files. They charge a flat $0.015 per GB-month stored. This is a lot cheaper than S3 Standard standard pricing ($0.023 per GB-month), but more expensive than Glacier and marginally more expensive than S3 Standard - Infrequent Access. Wasabi is even cheaper at $0.0068 per GB-month but with a 1 TB billing minimum.

4. If you want public access to the files in your S3 bucket using your own domain name, you can create a CNAME record with whatever DNS provider you use. With R2 you cannot use a custom domain unless the domain is set up in Cloudflare. I had to register a new domain name for this purpose since I could not switch DNS providers for something like this.

5. If you care about the geographical region your data is stored in, AWS has way more options. At a previous job I needed to control the specific US state my data was in, which is easy to do in AWS if there is an AWS Region there. In contrast R2 and Wasabi both have few options. R2 has a "Jurisdictional Restriction" feature in Beta right now to restrict data to a specific legal jurisdiction, but they only support EU right now. Not helpful if you need your data to be stored in Brazil or something.


Thank you for providing product roadmap. Label all the above: coming soon.


Fantastic. Looking forward to convincing my CTO to switch to R2 when your team gets closer to finishing these!


I'm happy to be told to look harder but I couldn't find an R2 Object Lock equivalent.

I do have to wonder if that leaves R2 customers one minor compromise away from losing their whole data store.


I don't know about R2 specifically, but we migrated one of our service from S3 to Cloudflare Images, and we have been hit with over 40h+ of down time on CF's side over the last 30 days. One of the outage was 22 hours long. Today's outage has been ongoing for almost 12 hours and is still ongoing, and we have had 2 or 3 others >1h outages.

Every cloud provider has outages sometimes but CF has been horrendous.

We were actually planning on migrating some other parts to R2 but we are just ditching CF altogether and just going to pay a bit more on AWS for reliability.

So if R2 has been impacted even a third as much as CF images, that would definitely be an important consideration.


What third-party sites do people use to track vendor downtimes, because they don't declare it honestly themselves?

I found https://isdown.app/integrations/cloudflare/cloudflare-sites-...


I don’t know why this isn’t mentioned more. CF offering (R2/workers/pages) are extremely unreliable that I’m wondering if anyone is actually using them.


We are using Workers for ~12mo now with actually very little actual downtime. There have been some regional issues but no world wide outages.

That said we don't use any queues, KV, etc. Just pure JS isolates so that probably contributes to the robustness.

We do use the Cache API though and have ran into weirdness there. We also needed to implement our own Stale-While-Revalidate (SWR) because CF still refuses to implement this properly.

Overall CF is a provider that I would say we begrudging acknowledge as good. Stuff like the SWR thing can be really frustrating but overall reliability and performance are much better since moving to CF.


> Overall CF is a provider that I would say we begrudging acknowledge as good.

I don't understand. You say that you used a very small subset of their offering in a very specific and limited way; and with that you conclude that their offering is "good"? Shouldn't you make that conclusion after reviewing at least 50% of their offering?


All of those extra features aren't their offering. Their offering is their network, everything else is just icing.


It's been a while, but last time I checked, write latency on R2 was pretty horrendous. Close to 1s compared to S3's <100ms, tested from my laptop in SF. Wouldn't be surprised if they made progress on this front, but definitely do dig deeper if your workload is sensitive to write latency.

Another (that probably contributes directly to the write latency issues) is region selection and replication. S3 just offers a ton more control here. I have a bunch of S3 buckets replicating async across regions around the world to enable fast writes everywhere (my use case can tolerate eventual consistency here). R2 still seems very light on region selection and replication options. Kinda disappointed since they're supposed to be _the_ edge company.


As far as I know, R2 offers no storage tiers. Most of my s3 usage is archival and sits in glacier. From Cloudflare's pricing page, S3 is substantially cheaper for that type of workload.


I know people archive all kinds of data. I use Glacier as off-site backup for my measly 1TB of irreplaceable data. But I know many customers put petabytes in it.

What could you have a petabyte of that you're pretty sure you'll never need again? What kind of datasets are you storing?


> pretty sure you'll never need again?

It doesn't have to be nearly that stark.

If we factor out egress, since it's the same for everything, the bulk retrieval cost for glacier deep archive is only $2.50/TB.

That means that a full year of storage ($12) plus four retrievals ($10) is roughly the same price as a single month of normal S3 storage ($23).


long term work stuff. Things we would be contractually obligated to produce many years down the line.

Plenty of other people storing images, video, etc. a PB is really not that much stuff when it's not just for personal consumption.


There is no data locality. If your workload is in AWS already you might save money by keeping the data in the more expensive S3 vs going out to Cloudflare to fetch your bytes and return your results.

If you don't mind having your bits reside elsewhere, Backblaze B2 and Bunny.net single location storage are both cheaper than Cloudflare.


Is R2 subject to Cloudflare's universal API rate limit? They have an API rate limit of 1200 requests/5 minutes that I've hit many times with their images product.

And they won't increase it unless you become an enterprise customer in which case they'll generously double it.


There is the Images Batch API that isn't subject to the 1200 requests/5 minutes limit: https://developers.cloudflare.com/images/cloudflare-images/u...


R2 doesn't support versioning yet. If you need versioning you have to use DigitalOcean Spaces (also cheaper than S3) or S3.

Otherwise, I've been using R2 now in production for wakatime.com for almost a month now with Sippy enabled. The latency and error rates are the same as S3, with DigitalOcean having slightly higher latency and error rates.


For data not frequently egressed. Log storage for example, data locality will be better and you will only ever extract small % of logs stored.

Same with data that is aggregated into smaller data set within AWS before you egress it.


One major thing that R2 doesn't have is big data distributed table support. E.g. you can use BigQuery to query data on GCS, or you can use Athena on S3.


If you already use Cloudflare for lots of other things, no.

If you already use AWS for lots of other things, yes.


There are some dealbreakers to me.

1. No Object history and locking. So there is absolutely no way to recover files when you do any kinds of mistakes.

2. No object tiering and storage is not that cheap. Although R2 egress is free, R2 is only 35% cheaper than S3 in terms of storage, but it is not cheaper than other alternatives. Furthermore, R2 is a lot more expensive than S3 infrequent/cold tier.

For example, Backblaze B2 is 4 times cheaper than S3, and B2 offers history/locking. When B2 egress is free up to 3x monthly storage, B2 is much better option than R2 for most cases if a considerably high egress is not needed.


Although it's probably faded from everyone's mind, I think Cloudflare and Backblaze still have the Bandwidth Alliance going which means free egress if you combine them.


I'm curious how the performance compares these days.

The last time I benchmarked B2 was years ago but it wasn't as reliable as I wanted at getting me files in under two seconds.


OP is missing that a correct implementation of Databricks or Snowflake will have those instances are running inside the same AWS region as the data. That's not to say R2 isn't an amazing product, but the egregious costs aren't as high since egress is $0 on both sides.


Author here and it is true that costs within a region are free and if you do design your system appropriately you can take advantage of it but I've seen accidental cases where someone will try to access in another region and it's nice to not even have to worry about it. Even that can be handled with better tooling/processes but the bigger point is if you want to have your data be available across clouds to take advantage of the different capabilities. I used AI as an example but imagine you have all your data in S3 but want to use Azure due to the OpenAI partnership. It's that use case that's enabled by R2.


Yeah, for greenfield work building up on R2 is generally a far better deal than S3, but if you have a massive amount of data already on S3, especially if it's small files, you're going to pay a massive penalty to move the data. Sippy is nice but it just spreads the pain over time.


> Sippy is nice but it just spreads the pain over time.

That egress money was going to be spent with or without sippy. It's not "just spreading" the pain, it's avoiding adding any pain at all.


I could be mistaken, but I believe AWS would still charge for one direction of an S3 to Databricks/Snowflake instance/cluster.


AWS S3 Egress charges are $0.00 when the destination is AWS within the same region. When you setup your Databricks or Snowflake accounts, you need to correctly specify the same region as your S3 bucket(s) otherwise you'll pay egress.


Cloudflare has been building a micro-AWS/Vercel competitor and I love it; i.e., serverless functions, queues, sqlite, kv store, object store (R2), etc.


FWIW, Vercel is at least partially backed by cloudflare services under the hood.


Right - Vercel's edge functions are just cloudflare workers with a massive markup.


Cloudflare is just reimplementing every service that Akamai has had for 10 years. The only difference is Cloudflare is going after the <$100/mo customer.


Vercel doesn't offer any of that, without major caveats (e.g. must use Next.js to get a serverless endpoint). And to the degree they do offer any of it, it's mostly built on infrastructure of other companies, including Cloudflare.


I would love to see a good blog post or article on Cloudflares KV store. I just checked it out, and it reports eventual consistency, so it sounds like it might be based upon CRDTs, but I'm just guessing.


We(Databend Labs) benchmarked #TPCH-SF100 on S3, Wasabi, Backblaze B2, and Cloudflare R2: #S3 leads with its direct connect feature. #Wasabi offers good performance. #B2 and #R2 may not be suitable for big data needs

Details: https://twitter.com/DatabendLabs/status/1719580350677237987


The other hidden cost when you are working with data hosted on S3 is the LIST requests. Some of the data tools seem very chatty with S3, and you end up with thousands of them when you have small filed buried in folders with a not insignifcant cost. I need to dig into it more, but they are always up there towards the top of my AWS bills.


I wish the R2 access control was similar to S3 - able to issue keys with specific accesses to particular prefixes, and ability to delegate ability to create keys.

It currently feels a little limited and… bolted on to the Cloudflare UI.


I think the idea is to use Cloudflare Workers to add more sophisticated functionality.


But then I’m forced to write server code where previously I needed none.


But then you start paying for Worker's bandwidth, correct?



I did some like-for-like comparisons across S3 vendors. S3 perf is way better than the challengers, R2 is the worst performer. Also it doesn't support concurrency on list operations, or object versions. So it's a bit more complex than "R2 is best coz it's cheapest" it's not super optimized yet

https://twitter.com/tomlarkworthy/status/1711846776905293967...


We moved entire infrastructure to AWS last year, to speed up/simplify/rethink it. We lasted 3 months on S3/CloudFront. We are still heavily invested in AWS, but moved our production storage/distribution to R2/Cloudflare and couldn't be happier.

Next up: moving our cloud edge (NAT Gateways, WAF, etc) to Fortinet appliances, which licenses we purchased bundled with our on-prem infra.

I know Corey Quinn always harps on AWS' egress pricing but you really can't emphasize it enough: it's literally extortionary!


S3 and R2 aside, OVHs object storage offering is really robust and great. It performs better than S3 and is way cheaper, in storage and egress cost.


You might even say their offering is… on fire


Agree. We've used it for two years with solid performance and reliability.


>> you’re paying anywhere from $0.05/GB to $0.09/GB for data transfer in us-east-1. At big data scale this adds up.

At small data scale this adds up.

And..... it's 11 cents a GB from Australia and 15 cents a GB from Brazil.

If you have S3 facing the Internet a hacker can bankrupt your company in minutes with simple load testing application. Not even a hacker. A bug in a web page could do the same thing.


200 TB in minutes is impressive.

(Assuming your company can be bankrupted for ~$20k.)


At small data scale you fit within the 100GB/month free tier for S3 or 1TB/month free tier for Cloudfront (in front of an S3 backend).


If you are storing large amount of data: E2 is the cheapest (20$/TB/year, 3x egress for free)

If you are having lots of egress: R2 is the cheapest (15$/TB/month, free egress)

R2 can get somewhat expensive if you have lots of mutations, which is not a typical use case for most.


What's E2? Top google result for "e2 blob storage" is azure, but that can't be it since the pricing table comes at around $18/TB/month.



Yes this is the one.


I think Backblaze B2 is probably the reference (which has free egress up to 3x data stored - https://www.backblaze.com/blog/2023-product-announcement/). I don't know of any public S3-compatible provider that is as cheap as 20$/TB/year (roughly ~$0.0016/GB/mo).



I stand corrected! Will bookmark this to take a look at this later. Thank you!


I imagine it was a typo for backblaze B2? The call out that egress is free for the first 3x of what you have stored matches up.


That’s what I thought they meant as well, but B2 is more like $72/TB/yr. Maybe relevant to another story on the front page right now, they have a very unusual custom keyboard layout that makes it easy to typo e for b and 2 for 7 ;)?



B2 is also nice at 6$/TB/mo and 3x free egress.

You can proxy things through cloudflare and get unlimited free egress thanks to bandwidth alliance between cf and b2


> In fact, there’s an opportunity to build entire companies that take advantage of this price differential and I expect we’ll see more and more of that happening.

Interesting. What sort of companies can take advantage of this?


Author here but some ideas I was thinking about: - An open source data pipeline built on top of R2. A way of keeping data on R2/S3 but then having execution handled in Workers/Lambda. Inspired by what https://www.boilingdata.com/ and https://www.bauplanlabs.com/ are doing. - Related to above but taking data that's stored in the various big data formats (Parquet, Iceberg, Hudi, etc) and generating many more combinations of the datasets and choose optimal ones based on the workload. You can do this with existing providers but I think the cost element just makes this easier to stomach. - Abstracting some of the AI/ML products out there and choosing best one for the job by keeping the data on R2 and then shipping it to the relevant providers (since data ingress to them is free) for specific tasks. -


Basically any company offering special services that work with very large data sets. That could be a consumer backup system like Carbonite or a bulk photo processing service. In either case, legal agreements with customers are key, because you ultimately don't control the storage system on which your business and their data depend.

I work for a non-profit doing digital preservation for a number of universities in the US. We store huge amounts of data in S3, Glacier and Wasabi, and provide services and workflows to help depositors comply with legal requirements, access controls, provable data integrity, archival best practices, etc.

There are some for-profits in this space as well. It's not a huge or highly profitable space, but I do think there are other business opportunities out there where organizations want to store geographically distributed copies of their data (for safety) and run that data through processing pipelines.

The trick, of course, is to identify which organizations have a similar set of needs and then build that. In our case, we've spent a lot of time working around data access costs, and there are some cases where we just can't avoid them. They can really be considerable when you're working with large data sets, and if you can solve the problem of data transfer costs from the get-go, you'll be way ahead of many existing services built on S3 and Glacier.


I'm building a "media hosting site". Based on somewhat reasonable forecasts of egress demand vs total volume stored, using R2 means I'll be able to charge a low take rate that should (in theory) give me a good counterposition to competitors in the space.

Basically, using R2 allows you to undercut competitors' pricing. It also means I don't need to build out a separate CDN to host my files, because Cloudflare will do that for me, too.

Competitors built out and maintain their own equivalent CDNs and storage solutions that are more ~10x more expensive to maintain and operate than going through Cloudflare. Basically, Cloudflare is doing to CDNs and storage what AWS and friends did to compute.


Your competitors can do the same thing though?


That'd be welcome, I'm not really doing it to make money.

But reality is a bit more complicated than that. Migrating data + pointers to that data, en masse, isn't super easy (although things like Sippy make it easier).

In addition, there's all the capex that's gone into building systems around the assumptions of their blend data centers, homegrown CDNs, mix of storage systems. There's a sunk cost fallacy at play, as well as the inertia of knowing how to maintain the old system and not having any experience with the new system.

It's not impossible, but it'd require a lot of willpower and energy that these companies (who are 10+ years into their life cycles) don't really possess.

Having seen the inside of orgs like that before, starting from scratch is ~10x-100x easier, depending on the blend of bureaucracy on the menu.


I'm investigating the same thing. But my bet is that they will either change the terms or lower your cdn-cache size (therefore lowering performance, you can't serve popular videos without a CDN).

And the difference is that you will fail your customers when that time comes because you'll just get suspended (we've seen some cases here on the forum) and you'll have to come here to complain so the ceo/cto resumes things for you.


I don’t believe anybody on a paid plan has been suspended for using R2 behind the CDN? (I’ve seen the stories you’re alluding to. IIRC the cached files weren’t on R2)

In their docs they explicitly state it as an attractive feature to leverage, so that’d surprise me.

That being said, I’m not planning to serve particularly large files with any meaningful frequency, so in my particular case I’m not concerned about that possibility. (I’m distributing low bitrate audio, and small images, mostly).

If I were trying to build YouTube or whatever I’d be more concerned.

That being said, with their storage pricing and network set up as they are, I think they make plenty of money off of a hypothetical YouTube clone.

I do think they’ll raise prices eventually. But it’s a highly competitive space, so it feels like there’s a stable ceiling.


See https://news.ycombinator.com/item?id=34639212. They got suspended for using workers behind the CDN.

> I’m distributing low bitrate audio, and small images, mostly

This means the cache-size would be much smaller though.


Right, but they were serving content that wasn't from R2 as far as I understand from that thread. Not trying to say they that justifies their treatment, only that it doesn't apply to my use case. They were also seeing ~30TB of daily egress on a non-enterprise plan, which would absolutely never happen in my case – 1TB of daily egress would be a p99.9 event.

Re cache-size, maybe I've misunderstood what you mean by cache size limiting, but yeah that's my point – I don't need a massive cache size for my application. My data doesn't lend itself much to large and distributed spikes. Egress is spiky, but centralized to a few files at a time. e.g. if there were to be a single day where 1TB were downloaded at once, 80% of it would be concentrated into ~20 400MB-sized files.


He was ok by the terms though. Workers had/have the same terms as R2 before R2 got the new terms.

> They were also seeing ~30TB of daily egress on a non-enterprise plan, which would absolutely never happen in my case – 1TB of daily egress would be a p99.9 event.

I don't understand what media company you'll be competing against if you'll use just 30TB/month of bandwidth.


I just love minio. It is a drop-in replacement for S3. I have never done a price comparison for TOC to S3 or R2, but I have a good backup story and run it all inside docker/dokku so it is easy to recover.


We went from s3 to minio because cost issues (at that time b2 didn't have s3 API)

Minio to seaweedfs around 2020 because our minio servers had problems serving very large number of small files.

Then this year we migrate to B2 because it's way cheaper and we don't have to rewrite our apps.

Still my hat goes to S3. It is so massive that every open source or competitors need to have compatible API and it give us the ability to move to any vendor or selfhost just by changing endpoint.


The simple reason cloudflare hasn't emerged as a real competitor is that they don't offer traditional compute therefore you can't just do what you normally would do in the hyperscalers in the clouflare regions. If they really are trying to be a fourth hyperscaler and/or compete on price it feels like generql compute is what they need. What am I missing


R2 and Sippy solve a specific pipeline issue: Storage -> CDN -> Eyeball

The real issue is how that data get's into S3 in the first place and what else you need to do with it.

S3 and DynamoDB are the real moats for AWS.


It blows my mind that anyone would consider S3 cheap.

You always had available plenty of space on dedicated servers for way cheaper before the cloud.

You could make an argument about the API being nicer than dealing with a linux server - but is AWS nice? I think it's pretty awful and requires tons of (different, specific, non transferable) knowledge.

Hype, scalability buzzwords thrown around by startups with 1000 users and 1M contract with AWS.

Sure R2 is cheaper but it's still not a low cost option. You are paying for a nice shiny service.


I think it all depends on the volume of data you're storing, access requirements, and how much value you plan to generate per GB.

It's certainly quite cheap for a set of typical "requirements" for media hosting companies.

But yeah, if you're storing data for mainly archival purposes, you shouldn't be paying for R2 or S3.


Interesting side note that while S3 the service continues to get more competition, S3 the protocol has definitively won. It's a good protocol, but man I wish it were more consumer-friendly. Imagine if S3 specified an OAuth2 profile for granting access. Every web app could delegate storage to a bunch of competing storage providers.

This would be very useful in genomics, where pretty much everything is stored on S3 but always a pain to connect to apps.


My fear is that as R2 becomes more 'discovered' and adopted, Cloudflare will hike prices since they'll have a captive audience.


It seems so inevitable. Once they have sucked enough data up then they can just change the pricing structure to have egress fees and higher storage fees.

Although I do wonder if that would be considered a bait and switch.


But what are you going to do with your data in R2? They don't have all the other cloud services to use the data. Unless your only use of the cloud is literally for raw storage, it's not that practical.

Compare with say Oracle cloud which tries to compete by having 1/10th the egress charge. But nobody uses it anyway and they DO offer all the other services.


You could use a computer to access your R2 files - a computer from anywhere on the Internet.


Yeah if you want to completely ignore Cloudflare Workers, Durable Objects, etc then that comparison makes sense I guess... but wait that is only if you also wanted to ignore that you can serve the files publicly directly so that alone has many use cases as well esp with the free bandwidth


It looks like Backblaze B2 combined with Cloudflare gives the cheapest storage and free egress. Is there any reason to use R2 over B2 + Cloudflare?

My use case is image storage + serving for a service that users will upload a lot of images to. Currently using Cloudflare + storing all files on disk but space will soon become a concern.


The problem here is as long as cloud services are sticky, moving your data doesn’t really solve the vendor locking, egress is just one way to leverage that characteristic, I can easily come up with another ten way to charge you as long as you can’t not easily migrate your stacks off a cloud platform.


If I understand correctly when storing data to vanilla S3 (not their edge offering) the data live in a single zone/datacenter right? While on R2 they could potentially be replicated in tens of locations. If that is true how can Cloudflare afford the storage cost with basically the same pricing?


S3 Standard guarantees that your data is replicated to three availability zones within the region at minimum. (That's different data centers in the same city.)

My assumption is that "at least three" means "exactly three" in practice.


It is clever marketing. Your object in R2 is stored globally as in it might reside anywhere, but they do not actually replicate the object globally.


For the Cloudflare fans out there (i am one of them), it seems that the sales/finance guys entered the company and start to apply the usual upsell tricks. (See the advanced firewall and bots stuff)

Perhaps i’m too hasty with my judgement, hope so….


R2 is a nice and cheap service, I just want to caution people, it does have a reduced feature set than something more mature than S3 or GCS, for most people who just want to server an image etc, it's fantastic though.


What would be great is a tiered storage service or library where oft-accessed data is in R2 and infrequently accessed has metadata in R2 but blobs in the cheaper S3 storage tiers or Glacier.


We absolutely love R2, especially when paired with Workers.


Did Cloudflare share any information on how R2 is built? Like what kind of open source systems they use as the foundation or they built it from scratch?


Should we simply ignore the tremendous amount of phishing hosted using r2.dev? Or is this also part of "an economic opportunity"?

Cloudflare may well be on their way to becoming a monopoly, but they certainly show they don't care about abuse. Even if it weren't a simple matter of principle, in case they aren't successful in forcing themselves down everyone's throats, I wouldn't want to host anything on any service that hosts phishers and scammers without even a modicum of concern.


I’ve seen lots of phishing websites hosted on S3, with responses from AWS coming weeks late.


I read this and cannot believe that I can optimize our $5-6K GCP egress bill to zero. Just wow.


Backblaze B2 ftw!


Since I know there will be Cloudflare people reading this (hi!), I'm begging you: please wrestle control of the blob storage API standard from AWS.

AWS has zero interest in S3’s API being a universal standard for blob storage and you can tell from its design. What happens in practice is that everybody (including R2) implements some subset of the S3 API, so everyone ends up with a jagged API surface where developers can use a standard API library but then have to refer to docs of each S3-compatible vendor to see figure out whether the subset of the S3 API you need will be compatible with different vendors.

This makes it harder than it needs to be to make vendor-agnostic open source projects that are backed by blob storage, which would otherwise be an excellent lowest-common-denominator storage option.

Blob storage is the most underused cloud tech IMHO largely because of the lack of a standard blob storage API. Cloudflare is in the rare position where you have a fantastic S3 alternative that people love, and you would be doing the industry a huge service by standardizing the API.


I think the subtle API differences reflect bigger and deeper implementation differences...

For example, "Can one append to an existing blob/resume an upload?" leads to lots of questions about data immutability, cacheability of blobs, etc.

"What happens if two things are uploaded with the same name at the same time" leads into data models, mastership/eventual consistency, etc.

Basically, these 'little' differences are in fact huge differences on the inside, and fixing them probably involves a total redesign.


This is a good point, but just a standard for the standard create/read/update (replace)/delete operations combined with some baseline guarantees (like approximately-last-write-wins eventual consistency) would probably cover a whole lot of applications that currently use S3 (which doesn't support appends anyway).

Heck, HTTP already provides verbs that would cover this, it would just require a vendor to carve out a subset of HTTP that a standard-compliant server would support, plus standardize an auth/signing mechanism.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: