Hacker News new | past | comments | ask | show | jobs | submit login
How Canva saves Amazon S3 costs (canva.dev)
134 points by kenaqshal on June 19, 2023 | hide | past | favorite | 177 comments



Seems crazy to run such large infrastructure on a major cloud.

Tens of millions wasted.

Canva is likely trapped in S3 never to exit. The cost of getting their data out makes it impossible.

S3…. the Hotel California of the cloud. You can check in any time you like but you can never leave.

S3 is 9 cents per GB egress fees.

Cloudflare R2 charges zero egress fees.

At list price of 9 cents per GB it would cost Canva about USD$20 million to export their 230 petabytes. Let me know if my calculation is wrong.


Tip for anyone looking to spend seven-figure or more sums on one-time egress: Direct Connect egress is $0.02/GB. Rent a rack at a Direct Connect facility and get as many 10G fiber Direct Connects as you need, with corresponding flat rate 10G Internet ports with HE/Cogent/whatever transit provider. If you're going to be spending millions on egress, you could just hire someone to set this up for you. With that kind of spend you'd be crazy to pay the full $0.09/GB.

Edit: Note also that Snowball egress is $0.03/GB. Slightly higher egress, much lower setup cost. You'll have to do the math but they're both clearly attractive options vs. full price $0.09/GB egress.


Cloudflare R2 also has a tool in beta that provides an incremental migration from S3. As users request files using your R2 URLs, Cloudflare automatically migrates the files from S3 to R2 on request. Depending on your user's request patterns this may be a way to migrate everything without paying any additional cost. If you have a lot of files that are rarely accessed then they will be slow to migrate, but because they are rarely accessed your S3 costs could be significantly reduced by using different storage classes anyway. Eventually you will have a small enough number of infrequently accessed files remaining in S3 that the additional cost to migrate them all in one go can make financial sense.


Awesome. I just wrote about it above. Thanks for sharing!


I worked at a place that did an AWS to GCP migration. I didn't look into the details, but they used a provider that basically gave them a fat, dedicated fiber link between the two (they picked the same GCP region). It was good enough that requests could have data cross clouds. It was expensive to set up, but it saved on egress costs and opened up options for migrating services.


I believe the relevant GCP product is "Dedicated Interconnect" and for Azure it's "ExpressRoute Direct" but don't quote me on that; I only know AWS. In my example you can swap out the Internet port for one of those cloud interconnects and go straight from AWS to your new cloud. That's certainly what that provider was doing. I bet there's lots of good money in facilitating those migrations using this relatively simple technique.


Plot twist : Canva blogged this so someone could tell them some solution like this, without them having to hire the guy. :)


There's this law on the internet or SO or whatever where if you put the wrong answer in a post.

Someone else in the comments will correct you with the correct answer.

This in effect.


This isn't just the internet. I've used this at work several times to break stalemates. I've come into many stalled projects where people are arguing over trivial details or blue skying on requirements. 1-2 hours and a 7 page design doc later I'll have everyone review what I know is an 80% answer. Pretty much every time they'll all start attacking my design, pointing out minor issues and then I have them. They've agreed to my overall design and are into details.

Same if someone is calling out all of the general problems with solving a problem but not providing answers where you can't get them to bite.



Cunningham is cunning!!!


There's always a cheaper way to do something, but it's important to remember that 'lower price' often doesn't mean 'lower cost'. In order to get the lower price you need to spend on whatever the alternative option isn't doing in order to give that price saving. For example, moving from AWS to onprem means you need to configure the infrastructure yourself; you save on AWS fees but you spend more on devops. And then you have to factor in things like the cost of downtime (on both sides of the equation, especially if you use us-east-1), the price of building your own services, the price of software you need to buy, and so on.

You only save money if the differential cost based on all the factors is lower. AWS is expensive for some things so it often does save some money, but not always, and if you haven't done a proper analysis you can't know.


>> moving from AWS to onprem means you need to configure the infrastructure yourself; you save on AWS fees but you spend more on devops

This is the traditional “sell” for cloud computing and I don’t buy it at all.

The clouds would have you believe it doesn't make sense to run your own systems because you need too much specialist expertise to run your own systems.

The clouds sales pitch is the if you go cloud then you don’t need all these specialists.

That’s rubbish. Cloud operations need the same or more headcount if technical specialists, they’re just doing different things.

The old “don’t run your own systems, it’s cheaper and easier to go cloud” is just sales fiction.

Don’t believe it.


Don’t believe it.

You don't need to 'believe'. You need to do an analysis of what you need and how much each option will cost. Then you can know.

If your argument is "I believe onprem saves money" or "I bet AWS is cheaper" or "Jim Morrison came to me in a dream and said I should use Azure" then you haven't done enough research.


> you haven't done enough research.

the research isn't free either. The more indepth and time it takes to do such research, the slower you come to a decision and ship.

Cloud allows you to ship fast. It allows you to go without research - just accept the marketing, and pay up.

You pay above-cost (compared to on-prem) when you grow to a certain size. But this is usually a worthy trade off tbh.


It's a good way to bypass legacy IT teams. A onprem server will have a bunch of snake oil endpoint protection products running on the box, bespoke config changes and take a couple months to get up and running, a ECS container is up and running in minutes and has whatever you shipped in the container without all the commentary from the peanut gallery.


A poor DX can happen in the cloud as well. Waiting weeks for IAM configuration to be solved or a security group to be opened..


I've run into this kind of snake oil in the cloud, as well. Some organizations demand all routing go through an "upstream" VPC operated by the parent org's IT dept, so various third-party security services (WAFs, IDSes, etc.) can scan / inspect the traffic.


When I moved reddit from on-prem to the cloud, I cut our costs by 27%. It was the same number of people managing both infrastructures, but once we moved to the cloud, I no longer spent most of my time imaging machines and driving them to a datacenter to rack and stack them. Instead I spent my time coding ways to manage machines via API, so that when I needed to double our infrastructure, I just ran a script.

Cloud makes a ton of sense for high growth companies. If your infrastructure is mostly static, that's when it makes sense to go to a datacenter. Or if you have one very specific use case, like Dropbox.


This is very unusual in my experience, except in cases where the on-prem was either massively over-provisioned or the customer was getting absolutely gouged on pricing at the DC or IP transit or whatever.

Do you recall what the big savings were in? Compute, storage, DB etc? Be really interested to understand why this case is so different to many I've read.


Funny enough I have the exact numbers handy! Keep in mind this was 2008 and we only had 150M page views a month back then.

Data Center (per month)

Servers: $6K

Cabinet (x3): $15K

Bandwidth: $2.5K

Support: N/A

Total: $23.5K

EC2 (per month)

Servers: $13K

Storage: $1.5K

Bandwidth: $1.1K

Support: $1.2K

Total: $16.8K


Interesting, thanks - I assume on EC2 you were running your own database servers? (Just based on the fact that you don't mention RDS.) This would keep costs down significantly in most cases although I assume they get transferred into staff costs if you've got people maintaining them to get the same level of support for things like backup, replication, etc that you get for "free" in RDS...?


Yes, we ran Postgres on EC2. I admined it myself both in AWS and in the datacenter. RDS didn't exist at the time. :)

At the time we and Heroku were running the two largest Postgres clusters on EC2.


There are hundreds of factors for each company and application. Sometimes it works and sometimes it doesn’t, but it’s rarely as simple as retail prices.

Even the same number of "technical specialists" doesn't mean it's the same if one option lets you move faster or remain more reliable.


This is precisely my experience and what I observe in companies that fall into the cloud trap. In addition, it is important not to overlook the significant transformation and upskilling process required, as well as the time needed to accomplish it.


Datacenters are expensive to run, for many reasons. That is the appeal of the Cloud right there, before we even start talking about the people aspect.


I challenge any in-house team to achieve the same reliability as S3.

To be fair most businesses probably don't need it, but it is worth taking into account.


the same reliability at what scale? this is a major difference. S3 has amazing reliability but is a incredible complex system with many moving parts, which incurs cost in terms of reliability. The simpler the system, the more reliable it usually is. Most bussiness do not operate at even remotely the same scale as amazon, and running a small scale cluster of S3 like functionality wouldn't be that hard.


S3 is reliable as data isn't not stored in a single place. It reduces the risk of water leakage or some vandals ripping your storage rack apart.

However, as someone put it, put your money where your mouth is. Become a contractor offering to reduce storage costs in exchange for 10% of the savings. You should be a millionaire in no time, according to your narrative.


Amazon S3 standard storage offers the following features: Backed with the Amazon S3 Service Level Agreement. Designed to provide 99.999999999% durability and 99.99% availability of objects over a given year.


> That’s rubbish. Cloud operations need the same or more headcount if technical specialists, they’re just doing different things.

It's a capex vs opex question. For financial quackery reasons, accountants and stock markets prefer having less people on the direct payroll, which is why everything not part of the "core business" is outsourced - even if it is more expensive in either short or long term.


> And then you have to factor in things like the cost of downtime (on both sides of the equation, especially if you use us-east-1)

This is an important factor to remember when evaluating could costs. If you want to survive a cloud outage, you need a multi-AZ or multi-region deployment, and that costs developer hours. And you need to deal with the potentially catastrophic cost of inter-AZ or inter-region traffic, which can be catastrophic and/or cost more developer hours to mitigate.


> If you want to survive a cloud outage, you need a multi-AZ or multi-region deployment, and that costs developer hours.

RDS databases are one click to set up in multi-AZ, and if your stack is Kubernetes based or at least EC2 autoscale capable it isn't much more work to make it multi-AZ as well.

Multi-Region deployments however, these are indeed expensive and nasty to set up.


usually you're right but the cost mentioned here is specific to BLOB data storage ... a very specific problem that can be isolated and run on external services and save the company lots of money.


^^^

The cloud vs. on-prem argument often seems to ignore the (enormous) middle-ground. Just because one portion of your architecture would do well to be run outside the cloud, doesn't mean you take out all the other parts you don't want to deal with yourself. Furthmore, "on-prem" might mean in your building, in someone else's building co-located and you rent space and control it, or in someone else's building where they deal with most hardware, etc.

That said, maybe Canva has considered a non-AWS solution and decided against it. Or maybe they've gotten certain better deals from AWS. We can't really know for sure on the outside.


My team uses AWS for _everything_. There's three reasons for this on our team (and I've researched it for our use cases.

- Consistency. Instead of saying "Oh look in AWS for this app, Azure for this one, and hetzner for this app, except it's test env is in AWS", it all just lives in AWS. It massively simplifies docs, onboarding, and reduces the amount of one-person specialised knowledge.

- Engineering Costs. Similar to above, but in terms of engineering, there's less to know and understand. Instead of needing to know how the AWS load balancer routes/connects to a VM somewhere else, and how that VM gets it's blob-storage-data from azure, we only need to understand AWS concepts.

- Vendor Lock In. Yeah, it's there. If we have a service that uses data from S3, there's egress costs from S3 to <other provider>, but not with EC2. We've consciously accepted this lock in for the time being.

Now, we're a 50 person company so YMMV, but the above tradeoffs plus an "opinionated" setup in AWS (everything on ECS, logging to Cloudwatch, RDS for DB) drastically reduced the "ops" overhead on our side after the initial setup. If I started over, I'd make the same decisions again.


> - Vendor Lock In. Yeah, it's there. If we have a service that uses data from S3, there's egress costs from S3 to <other provider>, but not with EC2. We've consciously accepted this lock in for the time being.

This is where I think the FCC should take action.

To the extent that this issue is a mutually agreeable arrangement between you and Amazon, it seems obnoxious but does not seem like it rises to the level where regulators should take action. But it affects third parties too: specifically, it prevents non-AWS-hosted vendors from effectively marketing their services to you. In that regard, I think the FTC should try to put a stop to this. AWS should not be permitted to effectively subsidize its and its partners’ services over outside competitors.

(And the US Government should never have accepted cloud deals with excessive egress costs. Part of the bidding process should have been a requirement for networking outside the winning provider to be priced competitively with internal networking)


But you're missing the point above. S3 is probably the easiest service to replace - there are loads of providers which use the _exact_ same protocol as S3. It's a drop in replacement. It literally uses AWS concepts, there is nothing else to learn apart from putting a different url into your application.

Very few people should really be using S3 at any serious scale is my thoughts. The cost savings are enormous (plus cloudflare for example replicates your data a lot closer to users for no extra cost, significantly improving performance). The cost savings can be absolutely enormous for very little/no additional complexity given how many providers are compatible with S3, and the fairly 'boring' nature of S3 compared to other technologies.


People really have to be afraid of running VPS or whatever if they can't spun up a min.io instance to have their own S3 without dealing with Amazon's Bullshit


How big of a production deployment have you ran on min.io in terms of data transfer / month and total storage?

Because I've done small scale and can tell you I'd run S3 in the future.


Point taken as I haven't used it in prod properly, I'll cross that bridge when I find it but Id rather keep my money and put some time into making it work. There's also cloudflare R2 now if S3 is too expensive but self-hosted is out of the question


> My team uses AWS for _everything_...If I started over, I'd make the same decisions again.

Okay? I never said that sticking to a single cloud provider isn't appropriate for some (or maybe even most) people. It's good that you have a setup that you believe works well for you.


or just consider different providers

and easily lower costs by 50% depending on balance storage/bandwidth


Nitpick: It's "you can _check out_ anytime you want but you can never leave"; the dissociation between checking out and actually leaving is what makes the line. Still a good reference for the S3 situation!


Yup, we just moved large chunks of our SaaS platform from the cloud to dedicated servers with Kubernetes serving as an internal cloud. Ended up reducing our cost to 30% (70% saved!).

We realized that instead of using scaling for peaks, by having additional dedicated servers available full time - it still worked out significantly cheaper. We also moved a lot of our internal processing to run in specific windows of time where the expected load was low to maximize server utilization in those periods.


Sorry what’s an “internal cloud”? Please tell me we haven’t renamed on-prem.


The term "private cloud" has existed for easily 15 years, pretty much straight after Google coined the term "cloud computing". The current deifnition of "on-prem" actually changed after cloud computing reached mass adoption. Prior to then, on-prem referred to "in your office" vs. "in a datacenter". These days "on-prem" means "not in public cloud".


Yes, it's basically the same as "on prem", but instead just a bunch of individual (virtual) servers they are managed by a system that provides typical cloud APIs, such as OpenStack.


It will mean compute resources that can be transparently shared amongst different things (which is where the K8s comes in) as opposed to boxes dedicated to a single service.


on-prem services but running with the tooling that is typically used for cloud computing rather than bare-metal on-prem of old.


Why do they need to get everything out? Some not-insignificant percentage of the content is surely archived/unused/unneeded assets. From customers who will never log in again or projects which are long over. Export the most recent 10% and put anything new on <new storage>


Absolutely agree. There are more nuanced strategies to migrate off a cloud provider. Could even enable a soft export for idle accounts, leave it where it lie and move it when customer re-engages.


I would think you could set up a migration process where existing things are loaded off of S3, and as soon as you make any changes, you save it to $NEW_STORAGE and delete it off of S3. Eliminates the need to egress anything, lowers your S3 bill over time, and as a user it'd be pretty hard to log in after a year and be upset that your FB cover image is no longer there, especially on the free tier.


Many aws services are hit or miss but s3 is just a marvel. Running geo-distributed 100s PB Ceph cluster with tiered storage isn't exactly trivial.


Millions of dollars per year isn't exactly a trivial amount of money either… Nobody said it's some super easy thing one could do on their free time, but given the scale it makes sense to build the knowledge in-house instead of getting vendor-locked-in at the most expensive service provider on the market.


Id argue this isn’t true until your spend surpasses $100MM. Lower amounts don’t justify the extra headache of on-prem.


I know we're just arguing about amounts at this point but that's pushing $2M a week. It seems on on-prem would make financial sense long before that.


Depends on your requirements. If you need 11 nines durability S3 is probably still cheaper


There's also the hybrids. Only storing "unlikely to be accessed" data on S3 would likely remove a huge part of the AWS bill, while significantly limiting the size of the on-prem data set.

Of all the possible related problems, object storage is the easiest.


I really hope that Canva's executive dont believe their product need such garantees…


> At list price of 9 cents per GB it would cost Canva about USD$20 million to export their 230 petabytes. Let me know if my calculation is wrong.

If you're doing a one-time export you can use a Snowball, which would cost about $70K for 230PB, assuming you had somewhere to load it off to once you got the data so you could give the device back.


It's hard to imagine why they would ever need to get everything out. That would be extremely wasteful. If they wanted to move off of S3, they'd rely on the same type of analysis they present in this post + perhaps creating a mechanism whereby they permanently move an asset off of S3 if/when accessed.


> It's hard to imagine why they would ever need to get everything out. That would be extremely wasteful.

See the Docker x Cloudflare case study where improving cache-hit ratio by 2% (by moving to R2) decreased S3 egress fee by 66%: https://www.cloudflare.com/en/case-studies/docker/


I'd go for a hybrid approach: new data is stored on private cloud, and old data would be gradually migrated or phased out.


Most businesses treat cloud like a utility, not tech. Don't want to generate their own power, deal with sewage ... and compute.

Which in 99.9% of the cases is the right decision. Sorry for all you infra techies, but that's the way tech matures.


Infra techies are fine. Likewise 'grid is mature' and yet most of buildings with floor space over NN m2 have team of electricians, being them direct hire or contractors. Same for the 'cloud' - 90%+ of the world deployed 'compute' is not cloud.


Also, infrastructure fundementals do not change, now matter where they are running.

As long as dev's don't seem to understand real world limitations, (for instance, a network cannot be instantanious, and latency is not consistent) and thus leave tons of performance on the table, i think the "infra guys" will be just fine.

Also, don't forget there is an entire world out there of people making sure connections to the cloud are even possible across the internet.


Work for a 30bn market cap company, all our stuff is built on AWS. Offices all around the globe. Not a single electrician on payroll.

YMMV.


> Not a single electrician on payroll.

Directly, yes maybe. Many companies don't have in-house experts (and get shafted by vendors as a result, as they're lacking the competence required to assess the quality of work).

But indirectly, they all have. In Germany, at least, you're required to have electrical appliances inspected by a licensed professional (an electrician or otherwise qualified person) at least every two years - most companies opt for using dedicated external companies, but you can also train someone for that task. On top of that come all the electricians hired by the landlords - the ones dealing with complaints and remodels, or servicing lifts, escalators, HVAC, datacenters, ...


Utility is just an old term for "as a service".

Electricity as a Service, which includes payments for expert maintenance.

Thread is about AWS and merits of in-sourcing, which is not how the world operates.

Anyways.


You may call your company arrangement `electricityless` even! Like `serverless` - we don't use servers.


If you still make a profit and grow with public Cloud, and you had to risk time-to-marker, reliability, man-years, to make something on your own, then it's not that crazy.

More of a "good problem to have" (and you can solve it when you need to).


The real story isn't how expensive it is to leave, it's how easy AWS makes it to spend. It used to be that you'd see how many servers a project needs and have to beg someone for them, and then only get half of what you wanted. With AWS, it's much easier to launch something new, but there isn't the same amount of scrutiny on spend.


“Nobody ever got fired for choosing I̵B̶M̶ Amazon”


Cloudflare could build a product that not only caches, but transfers S3 to R2 on usage.

- Since non transactional data is likely to be on GitHub or somewhere, it should be easy to re-hydrate that on R2.

- Once enough time passes, there could be a sunset on unused data, or a strategy planned to 1-time exit write-off for S3 with those assets not on R2.

Standard migration procedure.


What happens when cloudflare changes the price?


let me give some hypothetical; let's say Canva decides to cut cost by cutting the amount of storage users could have (e.g. quota, or lifetime):

- if they had a sunk cost CAPEX of 100 PB of storage, all they could do is sell the storage servers for a fraction of the original cost (and folks like me snapping up pre-owned server equipment for said fraction of cost for our home labs).

- on the other, since S3 is OPEX (not sure if they are locked in to 1 year billing cycles? but still better than 3-5 year CAPEX runs), they could cut the cost of that storage "immediately".

Similarly, if they needed to expand, CAPEX means placing an order for the storage, waiting for delivery, racking, burning in etc, versus just paying almost "instantly" for more storage.

So long as they are still earning money, why wouldn't they stick to this OPEX model?

Not saying this is good or bad, it depends on your business model, ultimately.


> So long as they are still earning money, why wouldn't they stick to this OPEX model?

The suggestion was to move to a different / cheaper provider. You can stick to OPEX.


> Cloudflare R2 charges zero egress fees.

The scarier part is Canva actually does use Cloudflare CDN, so it's not so much the cost of adding a new provider. It's just that the CDN team that manages the Cloudflare account is likely a different team :)

Too much isolation, silos and politics...


The much more likely answer is that R2 is too new compared to when they started (announced in 2021) and switching from S3 to it is probably not trivial at Canva's scale.


“we store more than 230 Petabytes of data in Amazon S3, with our single largest S3 bucket coming in at a whopping 45 Petabytes”

How did they get so much data?! They say they only have 75 million stock photos. They must have tens of petabytes that are just dark.


And now it's VC-fashionable to have S3 as "object store" for "serverless" databases. "offload to S3!"


They could optimize this by only egressing they most recently used petabytes of data for some value of “recently”.


what's the durability of R2?


How would you measure that realistically? How durable is Backblaze or Azure?

Cloudflare has been reliable on storage adjacent uses (caching) for a long time. Presumably you're more likely to be the cause of errors than R2 is by a few orders of magnitude.


well you don't measure it, you take it from their sla


What I mean is that if you have a claim for eleven 9's of durability for a product that's 3 years old you have to take their word for it.

You can't really check the SLA as being realistic


eleven 9's apparently


Do you know how much it’ll cost to store 230 petabytes per month on premises at the same redundancy and availability level? I have done such calculations for academia at a smaller scale and unless you literally want it to last decades it’s definitely not worth it to do your own infra on this.


I keep seeing that argument, but you, like the others, don't provide the figures you claim you got from your calculation.

Also, it seams that it's 230PB in total, not per month, which fits in an apartment-sized server room (you want to have several of such places, but it's not that big).


Wasabi has them. Scaling to 230pb would be more than their 1PB calculation. Taking their 5 yr total number, multiply to 230pb, divide by 5, then divide by 12 has a monthly total of 5,021,666. It's also 131 server racks according to wasabi. At 131 racks of storage you likely now need racks dedicated to networking since a top of rack switch wouldn't be enough. With space for the racks it's about 2,227sqft of space.

https://wasabi.com/blog/on-premises-vs-cloud-storage/


This is a marketing pitch for a cloud storage company and it shows:

> - System hardware: $500,000

For one PB? Gold plated HDDs and cases I guess…

> Assume 15% annual system maintenance over five years

Do not run your hard drive cluster on the same floor as vibration-inducing machines, folks. /s

According to Backblaze, even a 8 year-old hard drive is still twice more reliable than this, and 15% failure rate is what to expect over 6 years of consecutive use!

> Assume the space needed to store 1PB is between 25%-50% of a standard rack (42U)

You can fit almost 10PB in a rack though, so that's another figure that they've inflated.

> Assume personnel costs to manage a 1PB system are 0.5 IT/storage admin FTE

This one is probably true (and it's even an underestimation) if you have only 1PB of data, but it's not going to scale linearly: you're never going to need 130 FTE for 260PB.

Overall, if a supplier (who's going to use the same kind of off-the-shelf tech as you would) claims that his price is 4 times cheaper that what it would cost you to run it by yourself, he's probably just lying to sell you stuff, just saying…


Can you fit 10pb in a rack, sure if you want it all flash. You also need double that for mirroring, or quadruple that for raid 10. Double that if you want a mirrored copy of that array too. If you don't have high performance needs you can get by with slower hard drives. You also have the 25gb,40gb, or 100gb top of rack switch. Well you want two of those for redundancy. It starts to quickly add up. Annual system maintenance could be lower but is still a cost. Number 1 reason I'm in a datacenter is to swap a dead drive.

Oct 2020, StorONE announced a 1PB All-Flash array for $499k.

Other examples which are close to their pricing. https://www.reddit.com/r/storage/comments/vdr6ql/pricing_exa...


> Number 1 reason I'm in a datacenter is to swap a dead drive.

No doubt about that, when you have tens of thousands of drive you end up replacing some of them all the time, but not 15% of them every year…

> Oct 2020, StorONE announced a 1PB All-Flash array for $499k.

With Optane and all, right? The kind of things that's a massive overkill if your workload allows you to use Glacier…

> Other examples which are close to their pricing.

In that list, the only offers that are close to this pricing (but still almost 20% cheaper) also includes several years of support (which you probably don't want if you're Canva's scale).


> Oct 2020, StorONE announced a 1PB All-Flash array for $499k.

mind you the price of flash storage has dropped significantly since 2020.

Hard disks are still being used, but having all flash systems is not uncommon.


140 racks is a quite a lot of tonnage of AC as well. And probably hot and cold rows to make it function.


Hot/cold was factored in not the AC or power needs. I don't know many places you could just casually ask for 100+ racks. Maybe our midwest datacenters are just smaller than what is on the coast.


The last place I worked that had that kind of capacity, I had access to the lobby of that building and that was about it. The server room I could access had space for <wiggles fingers> maybe 15-20 racks and in fact had maybe 7, which is probably why it was always cold as fuck in there and I didn’t like being in there much.

Then there was a contract which had a server room twice the size and mostly contained an AS/3<mumble> and a couple racks of Intel hardware. You could probably get a ping pong table into that one. Giant rooms that are mostly white and too cold are a little freaky, and reminiscent of 2001.

There have been a number of stories of people not being able to utilize their floor space for more servers because they ran out of space on the roof for AC units. At one point I recall Google was working on power dissipation because they had hit the max electrical code, and so no one would agree to run more power into their buildings. More servers meant more compute per watt and per ton of chillers.


AWS has a talk about things like this. They were, no idea if they still do, setup an AZ based on MWH figure for most efficiency because any higher the power company doesn't like it. Our local DC the water cooling and refrigerant cooling is larger than the 5 generators they have. I've crammed a lot of servers in my rack and I'm out of power plugs but still under 2kwh of what we are on contract for. Dense servers have really messed with datacenters.


> I don't know many places you could just casually ask for 100+ racks.

That's obviously not something you can do on a whim, but at the same time we're talking about a seven figures investment plus the time to build the in-house team to manage this, so it's not going to happen overnight anyway.


A lot of people in this thread should start consultancies that transition companies off the cloud.

Apparently, it would be easy for them to save tens of millions, in turn making millions for the consultant.


Couldn't agree more.

One only says this if they don't have a good understanding of the differences between capex & opex, the amount of money spent on tech labor & compliance projects, as well as the level of assurances AWS give you regarding durability of data. All of this in 2023 when we have almost 20 years of data showing more and more business moving to the cloud and staying in the cloud.

I think anti-cloud sentiment aligns with hacker mentality on decentralization=good (which it is), anti big corporation feelings (boo amazon) and so it leads people in these threads to make emotional arguments over something that is clearly going in the other direction. It's an excellent example of confirmation bias, trying to look for any and all justification to say the cloud is worse, when the majority of businesses have decided otherwise.


Have you even looked at S3 bandwidth pricing? I don't get how anyone using S3 for storing media or binaries can justify that outrageous price unless every download is connected to something making you money.


Well, step 1, have petabytes of data in S3, then don't pay the sticker price.

Step 2, write articles like this as part of your newly contracted "engagement" with Amazon to help justify the lower price.


>step 1, have petabytes of data in S3, then don't pay the sticker price.

So pay Amazon a bunch for the privilege to have lower prices. How about just starting off somewhere where you don't have to commit to paying a lot of money to get a reasonable price?


Because when you start out you don't know you'll need petabytes of storage.

Canva grew very quickly. I don't think it was a terrible call to use the cloud at the time.


Cloud isn't the issue. Using overpriced services is the issue. Regardless of size they shouldn't have used S3.


It's just a part of natural cycle.

Cloud became a fad a while ago, companies like Amazon/Google/MSFT rushed in burning cash to provide incentives. Now said companies adjust fees to milk businesses due to high exit cost, just because they can. This will inevitably create outflow of customers, however we are far from this point currently.


I do. Interested? :D On a small project, I managed to reduce costs by 70 times simply by transitioning and eliminating unnecessary features. This happened four years ago, and the project is still running smoothly. However, what I truly enjoy is providing an objective view of the actual cloud transition process to companies before they make any misguided decisions.


Just a commentary that if you could save a bunch of companies 70x then that is the easiest sell in the world. I WISH I could do something like that because I would be a billionaire haha.


Consultants who help to move to the cloud are not cheap too. And migration to AWS/Azure is still more common than out.


Pretty much the only go-backs that I have ever seen were when someone tried to forklift an onprem application that was never meant to run on a computer whose cpu cycles are metered. One of the most striking things to me in ops:dev relations was the way the conversation around performance and efficiency changed once we started paying for it by the cycle.


There is a lot of good money to be made for people who understand cloud cost structures and the alternative options available. You'll want to target mid-size businesses that bet on a cloud and were actually successful but are now drowning in cloud bills. Large companies will hire in-house expertise, and small companies don't spend enough for it to be worth hiring a contractor. But I really think there's a market of mid-size companies who don't know how to reduce their AWS bills. Cloud cost optimization (and perhaps migrations between clouds or to hybrid cloud/on-prem at times) would be a great business for a solo contractor. Being able to pay for your own contracting fee out of the savings is almost a given.


They could do it all in a weekend, easy.


The thing about economics is that it's physics, with a time delay. We are seeing the time-delayed reaction to over-extension into the cloud play out, not instantly but in the downturn cycle.


So do you think server hardware companies will do well during a downturn?


I'm really surprised how much anti cloud opinion is being voiced here (but maybe I should have known better, after all these years on HN).

Even if you don't use any other cloud tech in your stack, my advice is, use S3 (or equivalent)! That level of durability, availability, and scalability, is not trivial to pull off, and is best made somebody else's problem. Even when you're as big as Canva.


I'm a big fan of Garage[1], which is a dead-simple S3 drop-in that you can host on your own drives. It's designed for consumer hardware with shitty internet in-between nodes.

[1]:https://garagehq.deuxfleurs.fr/


Thank you, that's amazing! While there isn't an extensive comparison available, I have come across a small benchmark that compares a few features. You can find it here: https://garagehq.deuxfleurs.fr/blog/2022-perf/


All the marketing pages in the world mean nothing until they can show who currently uses them and at what scale.


Well, someone has to make that experiment and see if it’s right for them, right? It’s only a few years old, and most bigger companies are already sold on whatever they’re using. Smaller companies you’ve never heard of will be the first drivers, most likely.

FWIW, I have an EU region cluster (3 nodes per inner-region in the North, West and South). It’s all one logical region to S3 libraries and holds nearly half a terabyte.

It’s probably one of the most stable parts of my infrastructure. I don’t know (or need to know) what the replication lag is between regions, but IIRC, there’s some configuration of consistency.


Whoa! this is the first time I have seen deuxfleurs. Do french people use this over github?


I believe deuxfleurs is a programming co-op. The UI is called gitea IIRC, an open source GitHub clone.


This particular instance appears to be running Forgejo, a soft-fork of Gitea.


If you want trouble and exorbitant fees, the cloud is for you...


Something which is often overlooked is that transitioning objects in S3 to another storage class using lifecycle policies is a one-way operation. If you later notice that the chosen storage class doesn't fit or your use case changes, you can only go "forward" using lifecycle policies and not "back" to e.g. "S3 Standard" [1]. Going "back" is therefore a quite costly and cumbersome operation, as you have to apply the change of the storage class to each object individually.

[1]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecy...


Canva is a case where AWS made sense for the founders. The founders anyway made billions and who cares if you give a few hundreds millions to Jeff and have to hire expensive AWS consultants if you can avoid hiring a few ops people that actually know how to operate a server /s

More seriously, I think the sweet spot is to buy 2x o 3x what you expect to need but do it off the cloud.

At Canva's scale you'll spend hundreds of thousands, save literal millions and have way more headroom and stability than with crappy auctioned instances on AWS.


Apple uses AWS for most(?) of their cloud services. TikTok grew to the #1 spot entirely on a public cloud (Alibaba). I’m sure they sure have hired “a few ops people that actually know how to operate a server” instead.


The anti-cloud sentiment on HN is as old as time; there will always be posters who can claim that they can build X with just dedicated machines on Hetzner for 1/10th of the cost. Yet AWS keeps growing and the people running these companies must be unable to do basic math. My account just crossed 11 years, and I think the whole cloud vs on-prem conversation is probably one of the most least productive conversations on HN. If there was so much waste out there you would think someone would figure out how to just siphon all that waste directly into their pockets instead of Jeff's; and maybe they have, I don't know, but the posts are always the same, handwaving about how the mark ups are insane the engineers at these companies aren't able to balance a checkbook.


Inertia, group think, marketing, FUD, and "no one ever got fired for buying X" is INCREDIBLY powerful.

At this point even the not-infrequent public cloud outages are essentially shrugged off with "Eh, we use $CLOUD, that's what everyone uses. What are you going to do?"

The ops team can actually say "It's Amazon, not our problem." while they sit around powerless to do anything about it.

Yes there is the "survive anything AWS related" geo-distributed multi-region approach that you pretty rarely even see because it itself requires an army of AWS experts and drastically increased cost to actually implement and operate while not shooting yourself in the foot left and right - actually decreasing reliability. Not to mention the uber-unicorn "multi cloud" which at least I've never seen or actually heard of in practice. These don't usually survive even an initial cost review.


> there will always be posters who can claim that they can build X with just dedicated machines on Hetzner for 1/10th of the cost

and some of them just did. Both know names and small shops. Big names like cloud-flare R2 or Backblaze b2.


Apple operates their own data centers. Apple uses GCP as part of the Google's deal related to search.


At Canva's scale you'll spend hundreds of thousands a year, just hiring one of the engineers working on your home-grown solution. Multiple this by 50 or more just for the staff, before talking about hardware, cost of delivery speed, cost of bugs, etc.


Instead you can spend millions on paying top market dollar to AWS experts trying to work out your giant pile of spaghetti cloud infrastructure.

Desperately trying to work out if they can delete this resource or that because who the hell knows which bit of critical code is using it. Or spending days trying to manage a chewing gum ball of IAM policies.


Wouldn't you just be swapping those problems with similar problems if you switched to on prem?

> Desperately trying to work out if they can delete this resource or that because who the hell knows which bit of critical code is using it.

How is this unique to the cloud?


> Instead you can spend millions on paying top market dollar to AWS experts trying to work out your giant pile of spaghetti cloud infrastructure.

The point is that you will spend millions anyway. Might as well have something better than whatever home-grown dumpster fire your team will come up with.

You also seem very ignorant of how AWS works. I've never had to wonder if I could delete a resource or not, because all our resources are created via CloudFormation. Same for IAM policies.


Sounds like your experience is limited to badly managed CloudOps teams.


Articles like this are a great example of how you can be a hero for fixing a problem, but get no recognition for preventing one.


Surely they negotiated a private pricing agreement given the volume of data/usage...one would hope.

Having been a part of these negotiations for services other than S3, DTO discounts can be generous if you have a good forecast (3-5 years) on your data egress.


Offtopic — is it some cardinal sin to include a link to the home page from a documentation? I dont know what canva does, if I stumble on a documentation page first, should I not be allowed to see the product's landing page at some point? This is so prevalent, it has to be a conscious choice.


What? There are multiple homepage links on this page, including the Logo, Home from navigation menu, as well as the first word in the article is a link out to Canva's homepage.


The logo and "home" do not lead to the product page - they lead to developer docs home page.

Did not think to try the first link though.


Thank you for raising this! I'll take a look at how we can make the discovery more straightforward.


My favorite is when you click on some landing page made by a company.... and there is zero way to get to the rest of the site. Marketing-centric landing pages often completely strip out their global navigation for no reason.


Real question, how to scale a business to petabytes of data without cloud? How is it done in practice?


Buy hard drives. Typically the spinning ones for the data because $/GB is really cheap. Then you buy some SSDs for your actual applications to run on.

From there you install Hadoop, maybe a query tool like Spark, Impala, and Presto. Then you install some reporting tools like Apache SuperSet.

At least that's how we did it at my old job with Petabytes of data.


Quickly checking Dell sells you boxes that are advertised as 14PB per rack and have S3 api. You probably can do better by shopping around, but it gives a ballpark what is available in the on-prem market.


If you don't mind it running in a data center in Europe, you could rent servers from Hetzner for ca. $400 per month for 220TB. I'm sure they can provide a few PB on the spot. Spread them across their locations and you have geo redundancy too. Next to no egress fee also allows to have a copy with another provider.

https://www.hetzner.com/dedicated-rootserver/matrix-sx


Have a look at the Backblaze blogs, where they detail how they built a storage provider on the cheap.


and be prepared to write your own software stack


10 Petabytes is 10,000 TB. That's only 454 spinning disk drives (22 TB each). Approximate this as 600 with Raid 5. You can easily fit this in 100 desktops.

There's tons of OSS to manage all layers of this stuff now.


Slight tangent from the point, however: friends don't let friends use RAID5. Putting this out as a PSA reminder as it's not often discussed any more given the prevalance of cloud storage.

RAID5 gives you really the worst of all worlds. You get poor performance (n/4) due to the number of operations required and you get the fragility of a RAID storm that comes from a cascade RAID failure when a drive dies. A "stripe of mirrors" (RAID10) gives you the safest and the highest performance array (at scales where these decisions are being made which most people would see anyway). The common argument against is the cost (usableCapacity = totalCapacity/2) but my counter argument is that RAID5 is a timebomb with a secret countdown. RAID5 is great if you don't need the data on the array.

Source: >20 years running complex large arrays


Or “software” raid 10 with ZFS mirror dev’s all slapped into one pool?


Precisely that


What's your opinion on RAID6? Reasonable middle ground between cost and redundancy or added complexity with the same performance problems as RAID5?


With Ceph, and I imagine other similar solutions, you can skip RAID altogether and manage redundancy at higher level over plain disks. RAID makes sense if you specifically want to have conventional filesystem on top of it, but for object storage its not necessary.


You're applying the same principles, just at the object level rather than the filesystem level. You still have to choose a redundancy pattern.

But I do agree with you, it does make the problem significantly easier/different.

ZFS and Ceph are marvelous creations.


RAID6 is passable for a low performance small scale bulk storage, but the performance hit is worse than RAID5 and the rebuild times are even longer.


RAID10 for 22TB disks with only two drives at "1" level? I would expect no less than 3 drives at such capacity. Otherwise there is a huge chance to lose the second drive while 1. changing the disc (=time, could be even days) and 2. replication.

What are your thoughts on implementing RAID 0 over four RAID 1 arrays, each consisting of three disks, resulting in a total of 12 drives?


Agreed, 22TB is a _long_ rebuild time. I'd either go 3 drives in the mirror in that scenario, or go for smaller spindles, depending on requirements. It's not always about the biggest spindles possible.


But that's on a single RAID in a single location. You'd probably need at least one failover, and backups for that as well, if you'd like to target higher availability / durability.

S3 keeps multiple copies in multiple geographic locations.


When you are distributing over multiple geographic locations, and much of the data can tolerate a longer round trip time for access, you can extend the RAID pattern across multiple locations, instead of within a location. Provided you have enough network bandwidth for recovery, you don't need a full RAID at each individual site.

Having more locations is actually helpful for this. Because of the independence of different locations, the risk of data loss is in some ways lower than a local RAID system where all drives might be fried at the same time by a single event.


You can have a few petabytes in a single rack.

50-ish drives in a single 4U server is done.


What about single points of failure? Replication? High availability? Disaster recovery? Backup / snapshots/ versioning? This is one of the benefits of the cloud.


half of these are already managed by something like ZFS.. (the only one i can think of which is not inherent in the Filesystem is High availability). But one should build that kind of redundancy on a higher level anyways. (multiple storage pools, replicating writes etc).

or, if you go fibre channel, just replicate the writes across multiple storage systems and be done with it. (this requires a fibre channel network though, which is neither easy nor cheap).


Use a distributed file system or object store with what you need, like Quobyte or Ceph.


Plenty of tools for that on premises. Are you actually asking?


No, I'm just mentioning these things to show that the decision is more complex than a "4U server" vs "S3." S3/AWS gives you a ton of stuff you'd have to implement yourself. Plus you need to hire more people to manage it.


Minio does most of S3 for example, duh you would have to do some stuff yourself but you wouldn’t be implementing everything yourself from scratch, far from it.

https://min.io/


If you're at a point of spending $1.6M to transition objects from one class to another and spending millions per month in storage costs -- you need to have a real conversation about if storing your data in a vendor-locked cloud is the right path forward. S3 isn't the only option, MinIO + dense storage is one viable option if your spend is high enough to justify running MinIO. Backblaze is another.

List price of Backblaze:

$0.005/GB - Base Storage

$0.01/GB - Egress (Use a data partner to bring this down to $0)

$0 - PUT API Requests

$0.004/10,000 - GET API Requests

No delete penalties a/k/a minimum storage

List price of Glacier Instant Retrieval:

$0.004/GB - Base Storage

$0.09/GB - Egress

$0.02/1,000 - PUT API Requests

$0.01/1,000 - GET API Requests

Billed for 90 days of storage


There is no way they need 230 PB saved. That's javascript-tier design in practice. If you can put almost all of it into glacial storage then that's because you don't actually need to store it.


> Canva now has users across 190 countries, who design in over 100 languages thanks to some amazing localization work done by our Internationalisation team.

How may of those countries are profitable? My sense is Internationlization is a relatively fixed cost, but how does customer support in Burma, for example, not trump any realizable revenues there?


This is an interesting post...though I'm surprised to see this kind of optimization occur so late stage/as large as Canva.

There's quite a few anti-cloud opinions in the comments but for anyone who is hosted on S3 and would like this same level of automated analysis for cost savings on S3 (and other AWS services), we basically profile for this out-of-the-box on Vantage and it's free to get an optimization check to see savings: https://www.vantage.sh/

Not meant to be as shameless of a plug, but seems very relevant to the topic at hand for onlookers.


"I'm surprised to see this kind of optimization occur so late"

Every single day I read the trope on HN that "premature optimization is the root of all evil".

So, these guys literally followed HN's creed.


Canva is ten years old. They grew to 1 million users in two years. Doing this work 8 years ago wouldn't have been "premature optimization". More likely, this is happening now simply because the financial environment has changed and companies are now giving increased weight to projects that save money over those that add features.


Yeah no, R2 is a way better choice than being vendor locked like this when burning millions


> 100 million monthly active users

Oh, my. That's a lot. What does Figma? 300?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: