How Canva saves Amazon S3 costs

andrewstuart · on June 19, 2023

Seems crazy to run such large infrastructure on a major cloud.

Tens of millions wasted.

Canva is likely trapped in S3 never to exit. The cost of getting their data out makes it impossible.

S3…. the Hotel California of the cloud. You can check in any time you like but you can never leave.

S3 is 9 cents per GB egress fees.

Cloudflare R2 charges zero egress fees.

At list price of 9 cents per GB it would cost Canva about USD$20 million to export their 230 petabytes. Let me know if my calculation is wrong.

electroly · on June 19, 2023

Tip for anyone looking to spend seven-figure or more sums on one-time egress: Direct Connect egress is $0.02/GB. Rent a rack at a Direct Connect facility and get as many 10G fiber Direct Connects as you need, with corresponding flat rate 10G Internet ports with HE/Cogent/whatever transit provider. If you're going to be spending millions on egress, you could just hire someone to set this up for you. With that kind of spend you'd be crazy to pay the full $0.09/GB.

Edit: Note also that Snowball egress is $0.03/GB. Slightly higher egress, much lower setup cost. You'll have to do the math but they're both clearly attractive options vs. full price $0.09/GB egress.

_t4za · on June 19, 2023

Cloudflare R2 also has a tool in beta that provides an incremental migration from S3. As users request files using your R2 URLs, Cloudflare automatically migrates the files from S3 to R2 on request. Depending on your user's request patterns this may be a way to migrate everything without paying any additional cost. If you have a lot of files that are rarely accessed then they will be slow to migrate, but because they are rarely accessed your S3 costs could be significantly reduced by using different storage classes anyway. Eventually you will have a small enough number of infrequently accessed files remaining in S3 that the additional cost to migrate them all in one go can make financial sense.

vkaku · on June 19, 2023

Awesome. I just wrote about it above. Thanks for sharing!

dehrmann · on June 19, 2023

I worked at a place that did an AWS to GCP migration. I didn't look into the details, but they used a provider that basically gave them a fat, dedicated fiber link between the two (they picked the same GCP region). It was good enough that requests could have data cross clouds. It was expensive to set up, but it saved on egress costs and opened up options for migrating services.

electroly · on June 19, 2023

I believe the relevant GCP product is "Dedicated Interconnect" and for Azure it's "ExpressRoute Direct" but don't quote me on that; I only know AWS. In my example you can swap out the Internet port for one of those cloud interconnects and go straight from AWS to your new cloud. That's certainly what that provider was doing. I bet there's lots of good money in facilitating those migrations using this relatively simple technique.

fakedang · on June 19, 2023

Plot twist : Canva blogged this so someone could tell them some solution like this, without them having to hire the guy. :)

tough · on June 19, 2023

There's this law on the internet or SO or whatever where if you put the wrong answer in a post.

Someone else in the comments will correct you with the correct answer.

This in effect.

grogenaut · on June 19, 2023

This isn't just the internet. I've used this at work several times to break stalemates. I've come into many stalled projects where people are arguing over trivial details or blue skying on requirements. 1-2 hours and a 7 page design doc later I'll have everyone review what I know is an 80% answer. Pretty much every time they'll all start attacking my design, pointing out minor issues and then I have them. They've agreed to my overall design and are into details.

Same if someone is calling out all of the general problems with solving a problem but not providing answers where you can't get them to bite.

lbotos · on June 19, 2023

https://en.wikipedia.org/wiki/Ward_Cunningham#%22Cunningham'...

leon1717 · on June 19, 2023

Cunningham is cunning!!!

onion2k · on June 19, 2023

There's always a cheaper way to do something, but it's important to remember that 'lower price' often doesn't mean 'lower cost'. In order to get the lower price you need to spend on whatever the alternative option isn't doing in order to give that price saving. For example, moving from AWS to onprem means you need to configure the infrastructure yourself; you save on AWS fees but you spend more on devops. And then you have to factor in things like the cost of downtime (on both sides of the equation, especially if you use us-east-1), the price of building your own services, the price of software you need to buy, and so on.

You only save money if the differential cost based on all the factors is lower. AWS is expensive for some things so it often does save some money, but not always, and if you haven't done a proper analysis you can't know.

andrewstuart · on June 19, 2023

>> moving from AWS to onprem means you need to configure the infrastructure yourself; you save on AWS fees but you spend more on devops

This is the traditional “sell” for cloud computing and I don’t buy it at all.

The clouds would have you believe it doesn't make sense to run your own systems because you need too much specialist expertise to run your own systems.

The clouds sales pitch is the if you go cloud then you don’t need all these specialists.

That’s rubbish. Cloud operations need the same or more headcount if technical specialists, they’re just doing different things.

The old “don’t run your own systems, it’s cheaper and easier to go cloud” is just sales fiction.

Don’t believe it.

onion2k · on June 19, 2023

Don’t believe it.

You don't need to 'believe'. You need to do an analysis of what you need and how much each option will cost. Then you can know.

If your argument is "I believe onprem saves money" or "I bet AWS is cheaper" or "Jim Morrison came to me in a dream and said I should use Azure" then you haven't done enough research.

chii · on June 19, 2023

> you haven't done enough research.

the research isn't free either. The more indepth and time it takes to do such research, the slower you come to a decision and ship.

Cloud allows you to ship fast. It allows you to go without research - just accept the marketing, and pay up.

You pay above-cost (compared to on-prem) when you grow to a certain size. But this is usually a worthy trade off tbh.

willcipriano · on June 19, 2023

It's a good way to bypass legacy IT teams. A onprem server will have a bunch of snake oil endpoint protection products running on the box, bespoke config changes and take a couple months to get up and running, a ECS container is up and running in minutes and has whatever you shipped in the container without all the commentary from the peanut gallery.

m1keil · on June 19, 2023

A poor DX can happen in the cloud as well. Waiting weeks for IAM configuration to be solved or a security group to be opened..

icedchai · on June 19, 2023

I've run into this kind of snake oil in the cloud, as well. Some organizations demand all routing go through an "upstream" VPC operated by the parent org's IT dept, so various third-party security services (WAFs, IDSes, etc.) can scan / inspect the traffic.

jedberg · on June 19, 2023

When I moved reddit from on-prem to the cloud, I cut our costs by 27%. It was the same number of people managing both infrastructures, but once we moved to the cloud, I no longer spent most of my time imaging machines and driving them to a datacenter to rack and stack them. Instead I spent my time coding ways to manage machines via API, so that when I needed to double our infrastructure, I just ran a script.

Cloud makes a ton of sense for high growth companies. If your infrastructure is mostly static, that's when it makes sense to go to a datacenter. Or if you have one very specific use case, like Dropbox.

trog · on June 21, 2023

This is very unusual in my experience, except in cases where the on-prem was either massively over-provisioned or the customer was getting absolutely gouged on pricing at the DC or IP transit or whatever.

Do you recall what the big savings were in? Compute, storage, DB etc? Be really interested to understand why this case is so different to many I've read.

jedberg · on June 21, 2023

Funny enough I have the exact numbers handy! Keep in mind this was 2008 and we only had 150M page views a month back then.

Data Center (per month)

Servers: $6K

Cabinet (x3): $15K

Bandwidth: $2.5K

Support: N/A

Total: $23.5K

EC2 (per month)

Servers: $13K

Storage: $1.5K

Bandwidth: $1.1K

Support: $1.2K

Total: $16.8K

trog · on June 23, 2023

Interesting, thanks - I assume on EC2 you were running your own database servers? (Just based on the fact that you don't mention RDS.) This would keep costs down significantly in most cases although I assume they get transferred into staff costs if you've got people maintaining them to get the same level of support for things like backup, replication, etc that you get for "free" in RDS...?

jedberg · on June 23, 2023

Yes, we ran Postgres on EC2. I admined it myself both in AWS and in the datacenter. RDS didn't exist at the time. :)

At the time we and Heroku were running the two largest Postgres clusters on EC2.

manigandham · on June 19, 2023

There are hundreds of factors for each company and application. Sometimes it works and sometimes it doesn’t, but it’s rarely as simple as retail prices.

Even the same number of "technical specialists" doesn't mean it's the same if one option lets you move faster or remain more reliable.

AntonCTO · on June 19, 2023

This is precisely my experience and what I observe in companies that fall into the cloud trap. In addition, it is important not to overlook the significant transformation and upskilling process required, as well as the time needed to accomplish it.

jimwalsh · on June 19, 2023

Datacenters are expensive to run, for many reasons. That is the appeal of the Cloud right there, before we even start talking about the people aspect.

oxfordmale · on June 19, 2023

I challenge any in-house team to achieve the same reliability as S3.

To be fair most businesses probably don't need it, but it is worth taking into account.

kazen44 · on June 19, 2023

the same reliability at what scale? this is a major difference. S3 has amazing reliability but is a incredible complex system with many moving parts, which incurs cost in terms of reliability. The simpler the system, the more reliable it usually is. Most bussiness do not operate at even remotely the same scale as amazon, and running a small scale cluster of S3 like functionality wouldn't be that hard.

oxfordmale · on June 21, 2023

S3 is reliable as data isn't not stored in a single place. It reduces the risk of water leakage or some vandals ripping your storage rack apart.

However, as someone put it, put your money where your mouth is. Become a contractor offering to reduce storage costs in exchange for 10% of the savings. You should be a millionaire in no time, according to your narrative.

oxfordmale · on June 21, 2023

Amazon S3 standard storage offers the following features: Backed with the Amazon S3 Service Level Agreement. Designed to provide 99.999999999% durability and 99.99% availability of objects over a given year.

mschuster91 · on June 19, 2023

> That’s rubbish. Cloud operations need the same or more headcount if technical specialists, they’re just doing different things.

It's a capex vs opex question. For financial quackery reasons, accountants and stock markets prefer having less people on the direct payroll, which is why everything not part of the "core business" is outsourced - even if it is more expensive in either short or long term.

amluto · on June 19, 2023

> And then you have to factor in things like the cost of downtime (on both sides of the equation, especially if you use us-east-1)

This is an important factor to remember when evaluating could costs. If you want to survive a cloud outage, you need a multi-AZ or multi-region deployment, and that costs developer hours. And you need to deal with the potentially catastrophic cost of inter-AZ or inter-region traffic, which can be catastrophic and/or cost more developer hours to mitigate.

mschuster91 · on June 19, 2023

> If you want to survive a cloud outage, you need a multi-AZ or multi-region deployment, and that costs developer hours.

RDS databases are one click to set up in multi-AZ, and if your stack is Kubernetes based or at least EC2 autoscale capable it isn't much more work to make it multi-AZ as well.

Multi-Region deployments however, these are indeed expensive and nasty to set up.

username_my1 · on June 19, 2023

usually you're right but the cost mentioned here is specific to BLOB data storage ... a very specific problem that can be isolated and run on external services and save the company lots of money.

CogitoCogito · on June 19, 2023

^^^

The cloud vs. on-prem argument often seems to ignore the (enormous) middle-ground. Just because one portion of your architecture would do well to be run outside the cloud, doesn't mean you take out all the other parts you don't want to deal with yourself. Furthmore, "on-prem" might mean in your building, in someone else's building co-located and you rent space and control it, or in someone else's building where they deal with most hardware, etc.

That said, maybe Canva has considered a non-AWS solution and decided against it. Or maybe they've gotten certain better deals from AWS. We can't really know for sure on the outside.

maccard · on June 19, 2023

My team uses AWS for _everything_. There's three reasons for this on our team (and I've researched it for our use cases.

- Consistency. Instead of saying "Oh look in AWS for this app, Azure for this one, and hetzner for this app, except it's test env is in AWS", it all just lives in AWS. It massively simplifies docs, onboarding, and reduces the amount of one-person specialised knowledge.

- Engineering Costs. Similar to above, but in terms of engineering, there's less to know and understand. Instead of needing to know how the AWS load balancer routes/connects to a VM somewhere else, and how that VM gets it's blob-storage-data from azure, we only need to understand AWS concepts.

- Vendor Lock In. Yeah, it's there. If we have a service that uses data from S3, there's egress costs from S3 to <other provider>, but not with EC2. We've consciously accepted this lock in for the time being.

Now, we're a 50 person company so YMMV, but the above tradeoffs plus an "opinionated" setup in AWS (everything on ECS, logging to Cloudwatch, RDS for DB) drastically reduced the "ops" overhead on our side after the initial setup. If I started over, I'd make the same decisions again.

amluto · on June 19, 2023

> - Vendor Lock In. Yeah, it's there. If we have a service that uses data from S3, there's egress costs from S3 to <other provider>, but not with EC2. We've consciously accepted this lock in for the time being.

This is where I think the FCC should take action.

To the extent that this issue is a mutually agreeable arrangement between you and Amazon, it seems obnoxious but does not seem like it rises to the level where regulators should take action. But it affects third parties too: specifically, it prevents non-AWS-hosted vendors from effectively marketing their services to you. In that regard, I think the FTC should try to put a stop to this. AWS should not be permitted to effectively subsidize its and its partners’ services over outside competitors.

(And the US Government should never have accepted cloud deals with excessive egress costs. Part of the bidding process should have been a requirement for networking outside the winning provider to be priced competitively with internal networking)

martinald · on June 19, 2023

But you're missing the point above. S3 is probably the easiest service to replace - there are loads of providers which use the _exact_ same protocol as S3. It's a drop in replacement. It literally uses AWS concepts, there is nothing else to learn apart from putting a different url into your application.

Very few people should really be using S3 at any serious scale is my thoughts. The cost savings are enormous (plus cloudflare for example replicates your data a lot closer to users for no extra cost, significantly improving performance). The cost savings can be absolutely enormous for very little/no additional complexity given how many providers are compatible with S3, and the fairly 'boring' nature of S3 compared to other technologies.

tough · on June 19, 2023

People really have to be afraid of running VPS or whatever if they can't spun up a min.io instance to have their own S3 without dealing with Amazon's Bullshit

vasco · on June 19, 2023

How big of a production deployment have you ran on min.io in terms of data transfer / month and total storage?

Because I've done small scale and can tell you I'd run S3 in the future.

tough · on June 20, 2023

Point taken as I haven't used it in prod properly, I'll cross that bridge when I find it but Id rather keep my money and put some time into making it work. There's also cloudflare R2 now if S3 is too expensive but self-hosted is out of the question

CogitoCogito · on June 19, 2023

> My team uses AWS for _everything_...If I started over, I'd make the same decisions again.

Okay? I never said that sticking to a single cloud provider isn't appropriate for some (or maybe even most) people. It's good that you have a setup that you believe works well for you.

PaywallBuster · on June 19, 2023

or just consider different providers

and easily lower costs by 50% depending on balance storage/bandwidth

speed_spread · on June 19, 2023

Nitpick: It's "you can _check out_ anytime you want but you can never leave"; the dissociation between checking out and actually leaving is what makes the line. Still a good reference for the S3 situation!

yashkadakia · on June 19, 2023

Yup, we just moved large chunks of our SaaS platform from the cloud to dedicated servers with Kubernetes serving as an internal cloud. Ended up reducing our cost to 30% (70% saved!).

We realized that instead of using scaling for peaks, by having additional dedicated servers available full time - it still worked out significantly cheaper. We also moved a lot of our internal processing to run in specific windows of time where the expected load was low to maximize server utilization in those periods.

ssss11 · on June 19, 2023

Sorry what’s an “internal cloud”? Please tell me we haven’t renamed on-prem.

simonjgreen · on June 19, 2023

The term "private cloud" has existed for easily 15 years, pretty much straight after Google coined the term "cloud computing". The current deifnition of "on-prem" actually changed after cloud computing reached mass adoption. Prior to then, on-prem referred to "in your office" vs. "in a datacenter". These days "on-prem" means "not in public cloud".

_flux · on June 19, 2023

Yes, it's basically the same as "on prem", but instead just a bunch of individual (virtual) servers they are managed by a system that provides typical cloud APIs, such as OpenStack.

Marazan · on June 19, 2023

It will mean compute resources that can be transparently shared amongst different things (which is where the K8s comes in) as opposed to boxes dedicated to a single service.

monkeywork · on June 19, 2023

on-prem services but running with the tooling that is typically used for cloud computing rather than bare-metal on-prem of old.

ttymck · on June 19, 2023

Why do they need to get everything out? Some not-insignificant percentage of the content is surely archived/unused/unneeded assets. From customers who will never log in again or projects which are long over. Export the most recent 10% and put anything new on <new storage>

notyourwork · on June 19, 2023

Absolutely agree. There are more nuanced strategies to migrate off a cloud provider. Could even enable a soft export for idle accounts, leave it where it lie and move it when customer re-engages.

pc86 · on June 19, 2023

I would think you could set up a migration process where existing things are loaded off of S3, and as soon as you make any changes, you save it to $NEW_STORAGE and delete it off of S3. Eliminates the need to egress anything, lowers your S3 bill over time, and as a user it'd be pretty hard to log in after a year and be upset that your FB cover image is no longer there, especially on the free tier.

zokier · on June 19, 2023

Many aws services are hit or miss but s3 is just a marvel. Running geo-distributed 100s PB Ceph cluster with tiered storage isn't exactly trivial.

littlestymaar · on June 19, 2023

Millions of dollars per year isn't exactly a trivial amount of money either… Nobody said it's some super easy thing one could do on their free time, but given the scale it makes sense to build the knowledge in-house instead of getting vendor-locked-in at the most expensive service provider on the market.

notyourwork · on June 19, 2023

Id argue this isn’t true until your spend surpasses $100MM. Lower amounts don’t justify the extra headache of on-prem.

pc86 · on June 19, 2023

I know we're just arguing about amounts at this point but that's pushing $2M a week. It seems on on-prem would make financial sense long before that.

shawabawa3 · on June 19, 2023

Depends on your requirements. If you need 11 nines durability S3 is probably still cheaper

yencabulator · on June 20, 2023

There's also the hybrids. Only storing "unlikely to be accessed" data on S3 would likely remove a huge part of the AWS bill, while significantly limiting the size of the on-prem data set.

Of all the possible related problems, object storage is the easiest.

littlestymaar · on June 19, 2023

I really hope that Canva's executive dont believe their product need such garantees…

jedberg · on June 19, 2023

> At list price of 9 cents per GB it would cost Canva about USD$20 million to export their 230 petabytes. Let me know if my calculation is wrong.

If you're doing a one-time export you can use a Snowball, which would cost about $70K for 230PB, assuming you had somewhere to load it off to once you got the data so you could give the device back.

xwowsersx · on June 19, 2023

It's hard to imagine why they would ever need to get everything out. That would be extremely wasteful. If they wanted to move off of S3, they'd rely on the same type of analysis they present in this post + perhaps creating a mechanism whereby they permanently move an asset off of S3 if/when accessed.

ignoramous · on June 19, 2023

> It's hard to imagine why they would ever need to get everything out. That would be extremely wasteful.

See the Docker x Cloudflare case study where improving cache-hit ratio by 2% (by moving to R2) decreased S3 egress fee by 66%: https://www.cloudflare.com/en/case-studies/docker/

ToJans · on June 19, 2023

I'd go for a hybrid approach: new data is stored on private cloud, and old data would be gradually migrated or phased out.

shadowtree · on June 19, 2023

Most businesses treat cloud like a utility, not tech. Don't want to generate their own power, deal with sewage ... and compute.

Which in 99.9% of the cases is the right decision. Sorry for all you infra techies, but that's the way tech matures.

betaby · on June 19, 2023

Infra techies are fine. Likewise 'grid is mature' and yet most of buildings with floor space over NN m2 have team of electricians, being them direct hire or contractors. Same for the 'cloud' - 90%+ of the world deployed 'compute' is not cloud.

kazen44 · on June 19, 2023

Also, infrastructure fundementals do not change, now matter where they are running.

As long as dev's don't seem to understand real world limitations, (for instance, a network cannot be instantanious, and latency is not consistent) and thus leave tons of performance on the table, i think the "infra guys" will be just fine.

Also, don't forget there is an entire world out there of people making sure connections to the cloud are even possible across the internet.

shadowtree · on June 19, 2023

Work for a 30bn market cap company, all our stuff is built on AWS. Offices all around the globe. Not a single electrician on payroll.

YMMV.

mschuster91 · on June 19, 2023

> Not a single electrician on payroll.

Directly, yes maybe. Many companies don't have in-house experts (and get shafted by vendors as a result, as they're lacking the competence required to assess the quality of work).

But indirectly, they all have. In Germany, at least, you're required to have electrical appliances inspected by a licensed professional (an electrician or otherwise qualified person) at least every two years - most companies opt for using dedicated external companies, but you can also train someone for that task. On top of that come all the electricians hired by the landlords - the ones dealing with complaints and remodels, or servicing lifts, escalators, HVAC, datacenters, ...

shadowtree · on June 19, 2023

Utility is just an old term for "as a service".

Electricity as a Service, which includes payments for expert maintenance.

Thread is about AWS and merits of in-sourcing, which is not how the world operates.

Anyways.

betaby · on June 19, 2023

You may call your company arrangement `electricityless` even! Like `serverless` - we don't use servers.

coldtea · on June 19, 2023

If you still make a profit and grow with public Cloud, and you had to risk time-to-marker, reliability, man-years, to make something on your own, then it's not that crazy.

More of a "good problem to have" (and you can solve it when you need to).

dehrmann · on June 19, 2023

The real story isn't how expensive it is to leave, it's how easy AWS makes it to spend. It used to be that you'd see how many servers a project needs and have to beg someone for them, and then only get half of what you wanted. With AWS, it's much easier to launch something new, but there isn't the same amount of scrutiny on spend.

littlestymaar · on June 19, 2023

“Nobody ever got fired for choosing I̵B̶M̶ Amazon”

vkaku · on June 19, 2023

Cloudflare could build a product that not only caches, but transfers S3 to R2 on usage.

- Since non transactional data is likely to be on GitHub or somewhere, it should be easy to re-hydrate that on R2.

- Once enough time passes, there could be a sunset on unused data, or a strategy planned to 1-time exit write-off for S3 with those assets not on R2.

Standard migration procedure.

whycombagator · on June 19, 2023

What happens when cloudflare changes the price?

xarope · on June 19, 2023

let me give some hypothetical; let's say Canva decides to cut cost by cutting the amount of storage users could have (e.g. quota, or lifetime):

- if they had a sunk cost CAPEX of 100 PB of storage, all they could do is sell the storage servers for a fraction of the original cost (and folks like me snapping up pre-owned server equipment for said fraction of cost for our home labs).

- on the other, since S3 is OPEX (not sure if they are locked in to 1 year billing cycles? but still better than 3-5 year CAPEX runs), they could cut the cost of that storage "immediately".

Similarly, if they needed to expand, CAPEX means placing an order for the storage, waiting for delivery, racking, burning in etc, versus just paying almost "instantly" for more storage.

So long as they are still earning money, why wouldn't they stick to this OPEX model?

Not saying this is good or bad, it depends on your business model, ultimately.

re-thc · on June 19, 2023

> So long as they are still earning money, why wouldn't they stick to this OPEX model?

The suggestion was to move to a different / cheaper provider. You can stick to OPEX.

re-thc · on June 19, 2023

> Cloudflare R2 charges zero egress fees.

The scarier part is Canva actually does use Cloudflare CDN, so it's not so much the cost of adding a new provider. It's just that the CDN team that manages the Cloudflare account is likely a different team :)

Too much isolation, silos and politics...

sofixa · on June 19, 2023

The much more likely answer is that R2 is too new compared to when they started (announced in 2021) and switching from S3 to it is probably not trivial at Canva's scale.

choppaface · on June 19, 2023

“we store more than 230 Petabytes of data in Amazon S3, with our single largest S3 bucket coming in at a whopping 45 Petabytes”

How did they get so much data?! They say they only have 75 million stock photos. They must have tens of petabytes that are just dark.

alecco · on June 19, 2023

And now it's VC-fashionable to have S3 as "object store" for "serverless" databases. "offload to S3!"

Sai_ · on June 19, 2023

They could optimize this by only egressing they most recently used petabytes of data for some value of “recently”.

avereveard · on June 19, 2023

what's the durability of R2?

VHRanger · on June 19, 2023

How would you measure that realistically? How durable is Backblaze or Azure?

Cloudflare has been reliable on storage adjacent uses (caching) for a long time. Presumably you're more likely to be the cause of errors than R2 is by a few orders of magnitude.

avereveard · on June 19, 2023

well you don't measure it, you take it from their sla

VHRanger · on June 21, 2023

What I mean is that if you have a claim for eleven 9's of durability for a product that's 3 years old you have to take their word for it.

You can't really check the SLA as being realistic

re-thc · on June 19, 2023

eleven 9's apparently

ramraj07 · on June 19, 2023

Do you know how much it’ll cost to store 230 petabytes per month on premises at the same redundancy and availability level? I have done such calculations for academia at a smaller scale and unless you literally want it to last decades it’s definitely not worth it to do your own infra on this.

littlestymaar · on June 19, 2023

I keep seeing that argument, but you, like the others, don't provide the figures you claim you got from your calculation.

Also, it seams that it's 230PB in total, not per month, which fits in an apartment-sized server room (you want to have several of such places, but it's not that big).

jabart · on June 19, 2023

Wasabi has them. Scaling to 230pb would be more than their 1PB calculation. Taking their 5 yr total number, multiply to 230pb, divide by 5, then divide by 12 has a monthly total of 5,021,666. It's also 131 server racks according to wasabi. At 131 racks of storage you likely now need racks dedicated to networking since a top of rack switch wouldn't be enough. With space for the racks it's about 2,227sqft of space.

https://wasabi.com/blog/on-premises-vs-cloud-storage/

littlestymaar · on June 19, 2023

This is a marketing pitch for a cloud storage company and it shows:

> - System hardware: $500,000

For one PB? Gold plated HDDs and cases I guess…

> Assume 15% annual system maintenance over five years

Do not run your hard drive cluster on the same floor as vibration-inducing machines, folks. /s

According to Backblaze, even a 8 year-old hard drive is still twice more reliable than this, and 15% failure rate is what to expect over 6 years of consecutive use!

> Assume the space needed to store 1PB is between 25%-50% of a standard rack (42U)

You can fit almost 10PB in a rack though, so that's another figure that they've inflated.

> Assume personnel costs to manage a 1PB system are 0.5 IT/storage admin FTE

This one is probably true (and it's even an underestimation) if you have only 1PB of data, but it's not going to scale linearly: you're never going to need 130 FTE for 260PB.

Overall, if a supplier (who's going to use the same kind of off-the-shelf tech as you would) claims that his price is 4 times cheaper that what it would cost you to run it by yourself, he's probably just lying to sell you stuff, just saying…

jabart · on June 19, 2023

Can you fit 10pb in a rack, sure if you want it all flash. You also need double that for mirroring, or quadruple that for raid 10. Double that if you want a mirrored copy of that array too. If you don't have high performance needs you can get by with slower hard drives. You also have the 25gb,40gb, or 100gb top of rack switch. Well you want two of those for redundancy. It starts to quickly add up. Annual system maintenance could be lower but is still a cost. Number 1 reason I'm in a datacenter is to swap a dead drive.

Oct 2020, StorONE announced a 1PB All-Flash array for $499k.

Other examples which are close to their pricing. https://www.reddit.com/r/storage/comments/vdr6ql/pricing_exa...

littlestymaar · on June 19, 2023

> Number 1 reason I'm in a datacenter is to swap a dead drive.

No doubt about that, when you have tens of thousands of drive you end up replacing some of them all the time, but not 15% of them every year…

> Oct 2020, StorONE announced a 1PB All-Flash array for $499k.

With Optane and all, right? The kind of things that's a massive overkill if your workload allows you to use Glacier…

> Other examples which are close to their pricing.

In that list, the only offers that are close to this pricing (but still almost 20% cheaper) also includes several years of support (which you probably don't want if you're Canva's scale).

kazen44 · on June 19, 2023

> Oct 2020, StorONE announced a 1PB All-Flash array for $499k.

mind you the price of flash storage has dropped significantly since 2020.

Hard disks are still being used, but having all flash systems is not uncommon.

hinkley · on June 19, 2023

140 racks is a quite a lot of tonnage of AC as well. And probably hot and cold rows to make it function.

jabart · on June 19, 2023

Hot/cold was factored in not the AC or power needs. I don't know many places you could just casually ask for 100+ racks. Maybe our midwest datacenters are just smaller than what is on the coast.

hinkley · on June 19, 2023

The last place I worked that had that kind of capacity, I had access to the lobby of that building and that was about it. The server room I could access had space for <wiggles fingers> maybe 15-20 racks and in fact had maybe 7, which is probably why it was always cold as fuck in there and I didn’t like being in there much.

Then there was a contract which had a server room twice the size and mostly contained an AS/3<mumble> and a couple racks of Intel hardware. You could probably get a ping pong table into that one. Giant rooms that are mostly white and too cold are a little freaky, and reminiscent of 2001.

There have been a number of stories of people not being able to utilize their floor space for more servers because they ran out of space on the roof for AC units. At one point I recall Google was working on power dissipation because they had hit the max electrical code, and so no one would agree to run more power into their buildings. More servers meant more compute per watt and per ton of chillers.

jabart · on June 19, 2023

AWS has a talk about things like this. They were, no idea if they still do, setup an AZ based on MWH figure for most efficiency because any higher the power company doesn't like it. Our local DC the water cooling and refrigerant cooling is larger than the 5 generators they have. I've crammed a lot of servers in my rack and I'm out of power plugs but still under 2kwh of what we are on contract for. Dense servers have really messed with datacenters.

littlestymaar · on June 19, 2023

> I don't know many places you could just casually ask for 100+ racks.

That's obviously not something you can do on a whim, but at the same time we're talking about a seven figures investment plus the time to build the in-house team to manage this, so it's not going to happen overnight anyway.

motoxpro · on June 19, 2023

A lot of people in this thread should start consultancies that transition companies off the cloud.

Apparently, it would be easy for them to save tens of millions, in turn making millions for the consultant.

vasco · on June 19, 2023

Couldn't agree more.

One only says this if they don't have a good understanding of the differences between capex & opex, the amount of money spent on tech labor & compliance projects, as well as the level of assurances AWS give you regarding durability of data. All of this in 2023 when we have almost 20 years of data showing more and more business moving to the cloud and staying in the cloud.

I think anti-cloud sentiment aligns with hacker mentality on decentralization=good (which it is), anti big corporation feelings (boo amazon) and so it leads people in these threads to make emotional arguments over something that is clearly going in the other direction. It's an excellent example of confirmation bias, trying to look for any and all justification to say the cloud is worse, when the majority of businesses have decided otherwise.

charcircuit · on June 19, 2023

Have you even looked at S3 bandwidth pricing? I don't get how anyone using S3 for storing media or binaries can justify that outrageous price unless every download is connected to something making you money.

Quarrel · on June 19, 2023

Well, step 1, have petabytes of data in S3, then don't pay the sticker price.

Step 2, write articles like this as part of your newly contracted "engagement" with Amazon to help justify the lower price.

charcircuit · on June 19, 2023

>step 1, have petabytes of data in S3, then don't pay the sticker price.

So pay Amazon a bunch for the privilege to have lower prices. How about just starting off somewhere where you don't have to commit to paying a lot of money to get a reasonable price?

Quarrel · on June 19, 2023

Because when you start out you don't know you'll need petabytes of storage.

Canva grew very quickly. I don't think it was a terrible call to use the cloud at the time.

charcircuit · on June 20, 2023

Cloud isn't the issue. Using overpriced services is the issue. Regardless of size they shouldn't have used S3.

jesterson · on June 20, 2023

It's just a part of natural cycle.

Cloud became a fad a while ago, companies like Amazon/Google/MSFT rushed in burning cash to provide incentives. Now said companies adjust fees to milk businesses due to high exit cost, just because they can. This will inevitably create outflow of customers, however we are far from this point currently.

AntonCTO · on June 19, 2023

I do. Interested? :D On a small project, I managed to reduce costs by 70 times simply by transitioning and eliminating unnecessary features. This happened four years ago, and the project is still running smoothly. However, what I truly enjoy is providing an objective view of the actual cloud transition process to companies before they make any misguided decisions.

motoxpro · on June 20, 2023

Just a commentary that if you could save a bunch of companies 70x then that is the easiest sell in the world. I WISH I could do something like that because I would be a billionaire haha.

citrin_ru · on June 19, 2023

Consultants who help to move to the cloud are not cheap too. And migration to AWS/Azure is still more common than out.

jabroni_salad · on June 19, 2023

Pretty much the only go-backs that I have ever seen were when someone tried to forklift an onprem application that was never meant to run on a computer whose cpu cycles are metered. One of the most striking things to me in ops:dev relations was the way the conversation around performance and efficiency changed once we started paying for it by the cycle.

electroly · on June 19, 2023

There is a lot of good money to be made for people who understand cloud cost structures and the alternative options available. You'll want to target mid-size businesses that bet on a cloud and were actually successful but are now drowning in cloud bills. Large companies will hire in-house expertise, and small companies don't spend enough for it to be worth hiring a contractor. But I really think there's a market of mid-size companies who don't know how to reduce their AWS bills. Cloud cost optimization (and perhaps migrations between clouds or to hybrid cloud/on-prem at times) would be a great business for a solo contractor. Being able to pay for your own contracting fee out of the savings is almost a given.

paxys · on June 19, 2023

They could do it all in a weekend, easy.

llanowarelves · on June 19, 2023

The thing about economics is that it's physics, with a time delay. We are seeing the time-delayed reaction to over-extension into the cloud play out, not instantly but in the downturn cycle.

hinkley · on June 19, 2023

So do you think server hardware companies will do well during a downturn?

jaza · on June 19, 2023

I'm really surprised how much anti cloud opinion is being voiced here (but maybe I should have known better, after all these years on HN).

Even if you don't use any other cloud tech in your stack, my advice is, use S3 (or equivalent)! That level of durability, availability, and scalability, is not trivial to pull off, and is best made somebody else's problem. Even when you're as big as Canva.

withinboredom · on June 19, 2023

I'm a big fan of Garage[1], which is a dead-simple S3 drop-in that you can host on your own drives. It's designed for consumer hardware with shitty internet in-between nodes.

[1]:https://garagehq.deuxfleurs.fr/

AntonCTO · on June 19, 2023

Thank you, that's amazing! While there isn't an extensive comparison available, I have come across a small benchmark that compares a few features. You can find it here: https://garagehq.deuxfleurs.fr/blog/2022-perf/

paxys · on June 19, 2023

All the marketing pages in the world mean nothing until they can show who currently uses them and at what scale.

withinboredom · on June 19, 2023

Well, someone has to make that experiment and see if it’s right for them, right? It’s only a few years old, and most bigger companies are already sold on whatever they’re using. Smaller companies you’ve never heard of will be the first drivers, most likely.

FWIW, I have an EU region cluster (3 nodes per inner-region in the North, West and South). It’s all one logical region to S3 libraries and holds nearly half a terabyte.

It’s probably one of the most stable parts of my infrastructure. I don’t know (or need to know) what the replication lag is between regions, but IIRC, there’s some configuration of consistency.

tansan · on June 19, 2023

Whoa! this is the first time I have seen deuxfleurs. Do french people use this over github?

withinboredom · on June 19, 2023

I believe deuxfleurs is a programming co-op. The UI is called gitea IIRC, an open source GitHub clone.

adduc · on June 19, 2023

This particular instance appears to be running Forgejo, a soft-fork of Gitea.

activiation · on June 19, 2023

If you want trouble and exorbitant fees, the cloud is for you...

Dunedan · on June 19, 2023

Something which is often overlooked is that transitioning objects in S3 to another storage class using lifecycle policies is a one-way operation. If you later notice that the chosen storage class doesn't fit or your use case changes, you can only go "forward" using lifecycle policies and not "back" to e.g. "S3 Standard" [1]. Going "back" is therefore a quite costly and cumbersome operation, as you have to apply the change of the storage class to each object individually.

[1]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecy...

jokethrowaway · on June 19, 2023

Canva is a case where AWS made sense for the founders. The founders anyway made billions and who cares if you give a few hundreds millions to Jeff and have to hire expensive AWS consultants if you can avoid hiring a few ops people that actually know how to operate a server /s

More seriously, I think the sweet spot is to buy 2x o 3x what you expect to need but do it off the cloud.

At Canva's scale you'll spend hundreds of thousands, save literal millions and have way more headroom and stability than with crappy auctioned instances on AWS.

oefrha · on June 19, 2023

Apple uses AWS for most(?) of their cloud services. TikTok grew to the #1 spot entirely on a public cloud (Alibaba). I’m sure they sure have hired “a few ops people that actually know how to operate a server” instead.

nemothekid · on June 19, 2023

The anti-cloud sentiment on HN is as old as time; there will always be posters who can claim that they can build X with just dedicated machines on Hetzner for 1/10th of the cost. Yet AWS keeps growing and the people running these companies must be unable to do basic math. My account just crossed 11 years, and I think the whole cloud vs on-prem conversation is probably one of the most least productive conversations on HN. If there was so much waste out there you would think someone would figure out how to just siphon all that waste directly into their pockets instead of Jeff's; and maybe they have, I don't know, but the posts are always the same, handwaving about how the mark ups are insane the engineers at these companies aren't able to balance a checkbook.

kkielhofner · on June 19, 2023

Inertia, group think, marketing, FUD, and "no one ever got fired for buying X" is INCREDIBLY powerful.

At this point even the not-infrequent public cloud outages are essentially shrugged off with "Eh, we use $CLOUD, that's what everyone uses. What are you going to do?"

The ops team can actually say "It's Amazon, not our problem." while they sit around powerless to do anything about it.

Yes there is the "survive anything AWS related" geo-distributed multi-region approach that you pretty rarely even see because it itself requires an army of AWS experts and drastically increased cost to actually implement and operate while not shooting yourself in the foot left and right - actually decreasing reliability. Not to mention the uber-unicorn "multi cloud" which at least I've never seen or actually heard of in practice. These don't usually survive even an initial cost review.

betaby · on June 19, 2023

> there will always be posters who can claim that they can build X with just dedicated machines on Hetzner for 1/10th of the cost

and some of them just did. Both know names and small shops. Big names like cloud-flare R2 or Backblaze b2.

betaby · on June 19, 2023

Apple operates their own data centers. Apple uses GCP as part of the Google's deal related to search.

iLoveOncall · on June 19, 2023

At Canva's scale you'll spend hundreds of thousands a year, just hiring one of the engineers working on your home-grown solution. Multiple this by 50 or more just for the staff, before talking about hardware, cost of delivery speed, cost of bugs, etc.

andrewstuart · on June 19, 2023

Instead you can spend millions on paying top market dollar to AWS experts trying to work out your giant pile of spaghetti cloud infrastructure.

Desperately trying to work out if they can delete this resource or that because who the hell knows which bit of critical code is using it. Or spending days trying to manage a chewing gum ball of IAM policies.

weird-eye-issue · on June 19, 2023

Wouldn't you just be swapping those problems with similar problems if you switched to on prem?

> Desperately trying to work out if they can delete this resource or that because who the hell knows which bit of critical code is using it.

How is this unique to the cloud?

iLoveOncall · on June 19, 2023

> Instead you can spend millions on paying top market dollar to AWS experts trying to work out your giant pile of spaghetti cloud infrastructure.

The point is that you will spend millions anyway. Might as well have something better than whatever home-grown dumpster fire your team will come up with.

You also seem very ignorant of how AWS works. I've never had to wonder if I could delete a resource or not, because all our resources are created via CloudFormation. Same for IAM policies.

rirze · on June 19, 2023

Sounds like your experience is limited to badly managed CloudOps teams.

xnx · on June 19, 2023

Articles like this are a great example of how you can be a hero for fixing a problem, but get no recognition for preventing one.

soderfoo · on June 19, 2023

Surely they negotiated a private pricing agreement given the volume of data/usage...one would hope.

Having been a part of these negotiations for services other than S3, DTO discounts can be generous if you have a good forecast (3-5 years) on your data egress.

orangepanda · on June 19, 2023

Offtopic — is it some cardinal sin to include a link to the home page from a documentation? I dont know what canva does, if I stumble on a documentation page first, should I not be allowed to see the product's landing page at some point? This is so prevalent, it has to be a conscious choice.

skilled · on June 19, 2023

What? There are multiple homepage links on this page, including the Logo, Home from navigation menu, as well as the first word in the article is a link out to Canva's homepage.

orangepanda · on June 19, 2023

The logo and "home" do not lead to the product page - they lead to developer docs home page.

Did not think to try the first link though.

SerCe · on June 26, 2023

Thank you for raising this! I'll take a look at how we can make the discovery more straightforward.

Solvency · on June 19, 2023

My favorite is when you click on some landing page made by a company.... and there is zero way to get to the rest of the site. Marketing-centric landing pages often completely strip out their global navigation for no reason.

rat9988 · on June 19, 2023

Real question, how to scale a business to petabytes of data without cloud? How is it done in practice?

withinboredom · on June 19, 2023

Buy hard drives. Typically the spinning ones for the data because $/GB is really cheap. Then you buy some SSDs for your actual applications to run on.

From there you install Hadoop, maybe a query tool like Spark, Impala, and Presto. Then you install some reporting tools like Apache SuperSet.

At least that's how we did it at my old job with Petabytes of data.

zokier · on June 19, 2023

Quickly checking Dell sells you boxes that are advertised as 14PB per rack and have S3 api. You probably can do better by shopping around, but it gives a ballpark what is available in the on-prem market.

dx034 · on June 19, 2023

If you don't mind it running in a data center in Europe, you could rent servers from Hetzner for ca. $400 per month for 220TB. I'm sure they can provide a few PB on the spot. Spread them across their locations and you have geo redundancy too. Next to no egress fee also allows to have a copy with another provider.

https://www.hetzner.com/dedicated-rootserver/matrix-sx

pjc50 · on June 19, 2023

Have a look at the Backblaze blogs, where they detail how they built a storage provider on the cheap.

bluedino · on June 19, 2023

and be prepared to write your own software stack

acchow · on June 19, 2023

10 Petabytes is 10,000 TB. That's only 454 spinning disk drives (22 TB each). Approximate this as 600 with Raid 5. You can easily fit this in 100 desktops.

There's tons of OSS to manage all layers of this stuff now.

simonjgreen · on June 19, 2023

Slight tangent from the point, however: friends don't let friends use RAID5. Putting this out as a PSA reminder as it's not often discussed any more given the prevalance of cloud storage.

RAID5 gives you really the worst of all worlds. You get poor performance (n/4) due to the number of operations required and you get the fragility of a RAID storm that comes from a cascade RAID failure when a drive dies. A "stripe of mirrors" (RAID10) gives you the safest and the highest performance array (at scales where these decisions are being made which most people would see anyway). The common argument against is the cost (usableCapacity = totalCapacity/2) but my counter argument is that RAID5 is a timebomb with a secret countdown. RAID5 is great if you don't need the data on the array.

Source: >20 years running complex large arrays

buildbot · on June 19, 2023

Or “software” raid 10 with ZFS mirror dev’s all slapped into one pool?

simonjgreen · on June 19, 2023

Precisely that

wongarsu · on June 19, 2023

What's your opinion on RAID6? Reasonable middle ground between cost and redundancy or added complexity with the same performance problems as RAID5?

zokier · on June 19, 2023

With Ceph, and I imagine other similar solutions, you can skip RAID altogether and manage redundancy at higher level over plain disks. RAID makes sense if you specifically want to have conventional filesystem on top of it, but for object storage its not necessary.

simonjgreen · on June 19, 2023

You're applying the same principles, just at the object level rather than the filesystem level. You still have to choose a redundancy pattern.

But I do agree with you, it does make the problem significantly easier/different.

ZFS and Ceph are marvelous creations.

simonjgreen · on June 19, 2023

RAID6 is passable for a low performance small scale bulk storage, but the performance hit is worse than RAID5 and the rebuild times are even longer.

AntonCTO · on June 19, 2023

RAID10 for 22TB disks with only two drives at "1" level? I would expect no less than 3 drives at such capacity. Otherwise there is a huge chance to lose the second drive while 1. changing the disc (=time, could be even days) and 2. replication.

What are your thoughts on implementing RAID 0 over four RAID 1 arrays, each consisting of three disks, resulting in a total of 12 drives?

simonjgreen · on June 19, 2023

Agreed, 22TB is a _long_ rebuild time. I'd either go 3 drives in the mirror in that scenario, or go for smaller spindles, depending on requirements. It's not always about the biggest spindles possible.

aljarry · on June 19, 2023

But that's on a single RAID in a single location. You'd probably need at least one failover, and backups for that as well, if you'd like to target higher availability / durability.

S3 keeps multiple copies in multiple geographic locations.

jlokier · on June 19, 2023

When you are distributing over multiple geographic locations, and much of the data can tolerate a longer round trip time for access, you can extend the RAID pattern across multiple locations, instead of within a location. Provided you have enough network bandwidth for recovery, you don't need a full RAID at each individual site.

Having more locations is actually helpful for this. Because of the independence of different locations, the risk of data loss is in some ways lower than a local RAID system where all drives might be fried at the same time by a single event.

colechristensen · on June 19, 2023

You can have a few petabytes in a single rack.

50-ish drives in a single 4U server is done.

icedchai · on June 19, 2023

What about single points of failure? Replication? High availability? Disaster recovery? Backup / snapshots/ versioning? This is one of the benefits of the cloud.

kazen44 · on June 19, 2023

half of these are already managed by something like ZFS.. (the only one i can think of which is not inherent in the Filesystem is High availability). But one should build that kind of redundancy on a higher level anyways. (multiple storage pools, replicating writes etc).

or, if you go fibre channel, just replicate the writes across multiple storage systems and be done with it. (this requires a fibre channel network though, which is neither easy nor cheap).

fh973 · on June 19, 2023

Use a distributed file system or object store with what you need, like Quobyte or Ceph.

colechristensen · on June 19, 2023

Plenty of tools for that on premises. Are you actually asking?

icedchai · on June 19, 2023

No, I'm just mentioning these things to show that the decision is more complex than a "4U server" vs "S3." S3/AWS gives you a ton of stuff you'd have to implement yourself. Plus you need to hire more people to manage it.

colechristensen · on June 19, 2023

Minio does most of S3 for example, duh you would have to do some stuff yourself but you wouldn’t be implementing everything yourself from scratch, far from it.

https://min.io/

encodedrose · on June 19, 2023

If you're at a point of spending $1.6M to transition objects from one class to another and spending millions per month in storage costs -- you need to have a real conversation about if storing your data in a vendor-locked cloud is the right path forward. S3 isn't the only option, MinIO + dense storage is one viable option if your spend is high enough to justify running MinIO. Backblaze is another.

List price of Backblaze:

$0.005/GB - Base Storage

$0.01/GB - Egress (Use a data partner to bring this down to $0)

$0 - PUT API Requests

$0.004/10,000 - GET API Requests

No delete penalties a/k/a minimum storage

List price of Glacier Instant Retrieval:

$0.004/GB - Base Storage

$0.09/GB - Egress

$0.02/1,000 - PUT API Requests

$0.01/1,000 - GET API Requests

Billed for 90 days of storage

MagicMoonlight · on June 19, 2023

There is no way they need 230 PB saved. That's javascript-tier design in practice. If you can put almost all of it into glacial storage then that's because you don't actually need to store it.

jgalt212 · on June 19, 2023

> Canva now has users across 190 countries, who design in over 100 languages thanks to some amazing localization work done by our Internationalisation team.

How may of those countries are profitable? My sense is Internationlization is a relatively fixed cost, but how does customer support in Burma, for example, not trump any realizable revenues there?

StratusBen · on June 19, 2023

This is an interesting post...though I'm surprised to see this kind of optimization occur so late stage/as large as Canva.

There's quite a few anti-cloud opinions in the comments but for anyone who is hosted on S3 and would like this same level of automated analysis for cost savings on S3 (and other AWS services), we basically profile for this out-of-the-box on Vantage and it's free to get an optimization check to see savings: https://www.vantage.sh/

Not meant to be as shameless of a plug, but seems very relevant to the topic at hand for onlookers.

Solvency · on June 19, 2023

"I'm surprised to see this kind of optimization occur so late"

Every single day I read the trope on HN that "premature optimization is the root of all evil".

So, these guys literally followed HN's creed.

lowbloodsugar · on June 19, 2023

Canva is ten years old. They grew to 1 million users in two years. Doing this work 8 years ago wouldn't have been "premature optimization". More likely, this is happening now simply because the financial environment has changed and companies are now giving increased weight to projects that save money over those that add features.

DarthNebo · on June 20, 2023

Yeah no, R2 is a way better choice than being vendor locked like this when burning millions

jgalt212 · on June 19, 2023

> 100 million monthly active users

Oh, my. That's a lot. What does Figma? 300?