Hacker News new | comments | show | ask | jobs | submit login
Does Amazon S3 really save money?
71 points by psaccounts on Jan 6, 2009 | hide | past | web | favorite | 45 comments
With a price tag of $0.150/GB/month, storing 1TB of data costs around $150/month on Amazon S3. But this is a recurring amount. So, for the same amount of data it would cost $1800/year and $3600/2-years. And this doesn't even include the data transfer costs.

Consider the alternative, with colocation the hardware cost of storing 1TB of data on two machines (for redundancy) would be around $1500/year. But this is fixed. And increasing the storage capacity on each machine can be done at the price of $0.1/GB. Which means that a RAID-1+redundant copies of data on multiple servers for 4TB of data could be achieved at $3000/year and $6000/2-years in a colocation facility. Whereas on S3 the same would cost $7200/year and $14,400/2-years.

Also, adding bandwidth+power+h/w replacement costs at a colocation facility would still keep the costs significantly lower than Amazon S3.

Given this math, what is the rationale behind going with Amazon S3? The Smugmug case study of 600TB of data stored on S3 seems misleading.

I do see several services that offer unlimited storage which is actually hosted on S3. For example, Smugmug, Carbonite etc. all offer unlimited storage for a fixed annual fee. Wouldn't this send the costs out of the roof on Amazon S3?

If your startup is using Amazon S3 for its storage needs, for the benefit of the startup community, can you please elaborate your rationale for choosing this service?

Hey there, I'm the CEO & Chief Geek at SmugMug. You're overlooking a few things:

- Amazon keeps at least 3 copies of your data (which is what you need for high reliability) in at least 2 different geographical locations. That's what we'd do ourselves, too, if we continued to use our own storage internally. So your math is off both on the storage costs and then the costs of maintaing two or more datacenters and the networks between them.

- When Amazon reduces their prices, you instantly get all your storage cheaper. This isn't something you get with your capital expenditure of disks - your costs are always fixed. This has upsides and downsides, but you certainly don't get instant price breaks to your OpEx costs. When they added cheaper, tiered storage, our bill with Amazon dropped hugely.

- There's built-in price pressure with Amazon, too. The cost of one month's rent is roughly the same as the cost of leaving. So if it gets too expensive (or unreliable or slow or whatever your metrics are), you can easily leave. And Amazon has incentive to keep lowering prices and improving speed & reliability to ensure you don't leave.

- CapEx sucks. It's hard on your cashflow, it's hard on your debt position if you need to lease or finance (we don't, but that just means it's even harder on our cashflow), it's hard on taxes (amortization sucks), etc etc. I vastly prefer reasonable OpEx costs, with no debt load, which is what Amazon gets us.

- Free data transfer in/out of EC2 can be a big win, too. It is for us, anyway.

- Our biggest win is simply that it's easy. We have a simpler architecture, a lot less people, and a lot less worry. We get to focus on our product (sharing photos) rather than the necessary evils of doing so (managing storage). We have two ops guys for a Top 500 website with over a petabyte of storage. That's pretty awesome.

Hope that helps!

Just curious, do you keep your own backups as well? Do you have any contingency plans for if Amazon's services go down?

I'm not trying to be cynical, but I'd hate to be in the position where Amazon has your business by the balls if something goes wrong. How do you guys deal with this?

amazon did go down once. smugmug has a blog post about it and what they did. go google it. the entire blog is amazing and informative and i highly recommend it.

i'd link you, but i'm lazy

Great reply. We also heavily use S3, and one nice advantage (although not free) was the recent enabling of their CDN named Cloudfront (http://aws.amazon.com/cloudfront/) helped latency on requests for assests from S3. It was nearly turnkey to enable & benefit from.

Great rationale, and it has served you well. I am a happy smugmug user.

I have a question about data security. If I wanted the data to be safe (in terms of intellectual property and corporate secrets), what do folks here think about using S3 and EC2? I really like the flexibility of using Amazon instead of buying servers for internal corporate data processing... but is secrecy a good enough reason to keep everything inside one's own firewall? (I suppose one could encrypt a whole TB of storage)

IMO it depends on what you're securing. If you have a line of business that is particularly susceptible to espionage and the expected loss would be fatal to the business you obviously shouldn't trust anyone you don't have to. However, be realistic about it: many companies don't have anything of substantial enough value to merit infiltrating Amazon to break in to your server to analyze, and while it's of course possible to statically analyze a filesystem the effort bar is higher than the "I'll just snoop in this user's mailbox" level.

On the other hand, if you're doing anything that is either illegal now or could become unpopular with a government that Amazon could be pressured by, you should consider your data confiscated now. I'm not saying I've heard of Amazon doing it, but there are already enough service provider wiretap and warrant laws out there that you'd never know what hit you if it happened.

I absolutely love smugmug! You guys rock and its architecture, focus on me, and killer features that will keep me renewing every year.

I also love the way you guys built your business. True pioneers.

You are paying a premium for scaling, bandwidth, operations and lower capital cost.

Being able to smoothly scale from 1TB to 2TB (or down to 500GB) is nothing to sneeze at. Nor is having a metered 250mbps connection (shop around -- hard to get less than $30/mbps at low volume), or having someone else handle the pager 24x7x365, or paying $150 at the end of the month instead of $5,000 up front.

There are systems and scales for which S3 is actually too expensive, and of course Amazon is making a profit off of all of this.

But there are a lot of hidden costs to DIY. Ask anyone who's tried to get 6 more servers flown in on the weekend, or had to cut short a holiday to drive to the damned colo at 3am, or overbought capacity, etc.

(edit) As for the "unlimited" option -- SM et al know to the byte what their average user uses so they price the unlimited option to make a profit on the average case.

Hypothetical case study:

You want to host a liveblog for the Apple keynote at Macworld. No matter if you are a small site or Engadget, your traffic for that 90 minutes would be 2x to 100x bigger than your daily average.

So do you:

a) buy or rent extra servers for the whole month, spend a few man-weeks (or pay somebody for) setting them up, working on their synchronization, etc.

b) write a small script that regularly uploads the static liveblog HTML (or JSON) to an S3 dubomain and rely on Amazon's thousands of servers and flexible scaling (Dynamo, which powers S3, will allocate as many servers as needed to handle your load) to do the work?

Granted, option b)'s per-GB cost would be a bit higher, but your fixed cost for labor and hardware in a) would be even bigger.

Don't forget time to spec servers, install the OS/Apps/Backup system, configure, test, drive to the datacenter, install the kit, document it. Also you will need to spend time setting up an account with a rackspace provider and arranging all the DNS, public IPs, any necessary firewalls, etc.

Also, server warranties, UPSs and the sheer hassle of specifying, ordering configuring servers and taking them to somewhere and fiddling with them and regularly patching them and so on.

And increasing the storage capacity on each machine can be done at the price of $0.1/GB

No it can't, you need to pay a competent admin to go to the servers, shut them down, install new drives, and start them up again and expand the RAID onto them. 1hr minimum. Assuming the RAID is expandable and there is physical space in the server - if you need to add more mounted disks the app must be adjusted to support that. If there is no space, you may need to replace some drives and handle moving the data onto them (several hours?), or worse buy a new extra server.

Assuming your backup system can just take another TB of data without any changes.

And what if something does go wrong? You're talking of at least half a day from the time you find out until you get someone to go, wait for the travel time, diagnose and repair and restart, then dealing with problem reports and complaints.

I hope Amazon has better monitoring on S3 than anything I have set up as well.

The problem is that storage costs are a step function. Once you cross a certain threshold (the threshold depends on your performance requirements a.k.a. IOPS), storage gets Really Freaking Expensive (tm). The steps start to get really, really steep as your capacity increases, and that's just for the primary copy of your data.

Once you get into truly large data amounts, other things start to break (RAID 5, RAID 6, tape backup, disk backup, synchronization, the ability to replace storage systems without massive outages). The good news is that they're almost all solved problems, but you're usually stuck with buying overpriced crap from EMC, Hitachi, NetApp, 3PAR and IBM (storage is a protection racket). All of this combines to explain why a good storage admin pulls down 6 figures a year.

I may be a bit myopic, but I see a world coming where technology startups trade capital costs for operating costs. S3 is pricey if you're dealing with small quantities of data, but once the step increase in your per-GB storage costs goes over 30%, you might want to reconsider. The steps only get bigger.

Something about your argument is not making sense. You say that as storage needs increase, the costs go up dramatically (that there are significant DISeconomies of scale).

Yet Amazon is acting as an aggregator of all those ever-increasing storage needs, operating at even higher scale, yet is able to turn a profit providing a service at scale. Amazon almost certainly isn't building their own hardware.

It seems you're arguing in circles somewhere there...

Sorry, I forgot to mention one of my assumptions...

There are tremendous diseconomies of scale if: you're buying the storage through traditional storage vendors. When your storage needs get really big, you'll realize that stuff is 1) a waste of money, and 2) not meeting your needs (you shape your needs to fit the available products, not vice versa). At that point, it makes enough sense to roll your own storage solution (write your own S3), tailored to your very specific needs.

James Hamilton posted a fairly robust model of how to think about this (and evaluate S3), just prior to joining Amazon.


His is a more datacenter-level view (considering $/GB, from the perspective of arbitrary GB), but builds a convincing case why S3 @ $0.18/GB is fair).

The reason is that you don't have to deal with it. Amazon isn't magic and they use the same hardware available to everyone else. Sure, there's some scale involved when it comes to labor and power and bandwidth, but they can't undercut what you can do yourself.

However, spraying files everywhere is a pain! MogileFS makes it a lot better, but you're still in charge of monitoring it and making sure it's healthy. With only two boxes, you have to be always on call so that you can order another box from your provider fast.

Plus, there's the issue of multiple data centers. S3 doesn't just make redundant copies of your data. It makes copies across data centers. So, you're paying $0.10/GB for data in, but you don't have to pay for when it replicates copies into several data centers.

You also have to realize that you have to pay for excess capacity anytime you're doing your own storage system. If you like to keep a 50% buffer (a reasonable size), you're going to be paying 1.5x the base cost of $0.10/GB that you've come up with.

And then there's the issue of having to make sure you're monitoring it and that if you see a spike in storage usage you can add drives fast enough. . .

You pay for a bit of convenience with S3. I'm not going to argue that it's cheaper, but it's definitely a lot less headache. Are you going to colo several boxes in different data centers, constantly monitor the storage, make sure that they serve the files properly, making sure that more copies get replicated if one server dies, replace drives as they fail, adding more servers as needed. . .

If you're on a large scale, I'd say you should do your own storage because you can justify making that someone's job (or a large enough portion of their job). I'm not sure I agree with SmugMug using S3, but I'm not sure I disagree either - it allows them to concentrate on what they want to do. Remember, for every tech person on HN, there's 100 that will say they're doing backups and aren't (ok, maybe not true, but you have to find an employee to manage your storage who you trust as much as Amazon).

However, most people don't have that much to store. If you're storing 100GB of data, you'd then be paying for multiple servers all with RAID and managing MogileFS or the like for what? 20% savings? $150/year? I'm as cheap as the next person, but I also like sleep. I don't want a pager calling me telling me that one of my two file stores is down and that I need to provision and configure a new box at 2am. And do you want to focus your time on creating a compelling product that your customers think is awesome or do you want to spend your time creating an awesome file store that works really well? Life has tradeoffs. You're not wrong, but I don't see Amazon as ripping people off with their pricing and I don't mind someone profiting from giving me a hassle-free, no-lock-in solution.

EDIT: I personally think your estimate of buying boxes and colo'ing them is a tad low so my 20% might be your 50% and so it might make sense by your numbers more. Maybe I've just seen crappy colo offers. Link if you know good ones! I love being proved wrong.

+1 for "I'm as cheap as the next person, but I also like sleep."

Pretty much sums it up.

Thank you for such a detailed and reasoned answer. I'm learning a hell of a lot from this thread already.

Well, I think what it really comes down to is what your business is. If your business is storage, get good at doing your own storage. If your business is web applications, get good at web applications. Then there are companies on the cusp. Flickr is clearly storage heavy enough that they need to be a storage company. 37signals, on the other hand, stores attachments and some pictures, but it their primary function isn't storage - it's the interaction with that stored content.

Is storage such a thing for your business that you're willing to put a lot of labor behind its solution? Or is storage important to your business, but as long as you get something reliable it doesn't have to be the most efficient possible because it's a small part of your business relative to the other things you do (like HTML, CSS, Ruby/Python/Perl/et al., MySQL/PostgreSQL)? Your time might be better spent on other company work than on storage.

Whenever you outsource anything to an "as a service" company, there are a couple of sweet spots where it makes sense, and a (generally) larger area where it does not make sense to pay someone else to do the service for you.

Amazon makes sense for companies that are too small to dedicate an admin to a server and/or to load that server and bandwidth to 60%+ capacity. At some point, when you have a need for a small server farm and the uptime of those servers is core to your business you will likely find it is more cost effective to do the operations in-house.

Think about companies like Salesforce.com, for their datacenter needs, I don't think you could ever justify Amazon (or a similar) service.

As a rule of thumb, I charge customers in my colo facility $60/U/mo as a starting price. You get a shared pipe and relatively unmetered/unlimited bandwidth (for "business" use, no torrent hosting or stupid shit). For a little more you can get dedicated bandwidth, 95% billing, etc. I've got someone who is looking at bringing in a 3U backup server, they'll pay $200/mo to have 3mb dedicated bandwidth and the colo space. That server will have a few TB of data replicated from their main site, I don't think S3 would make sense for them.

Regarding those services that offer "unlimited storage", I think they are really banking on the fact that many people _won't_ use enough space to cause the company a loss. Sure, there are heavy users, but when it all averages out, the company (hopefully) makes money.

Also, you have to read the terms of some services closely. IIRC, some of the online backup services are limited to "personal use only". They have separate pricing tiers for business backup (because they know those customers will use more).

That's called "over-selling" - its a pretty common model amongst any resource intensive businesses, which reminded me of a blog entry on the Dreamhost blog where Josh Jones makes a connection between the recent finance market debacle and over-selling: http://blog.dreamhost.com/2008/10/22/how-to-make-money/

Edit: the blog post is meant for light reading

With AWS i don't have to buy, set up, maintain or upgrade the hardware... or manage employees who do. Nor do i own the hardware. Nor do i have to travel to and from a datacenter.

For that service, the cost is very reasonable.

This goes for any hosting service and not just AWS as long as we are not talking about VPSes though.

I just did the math on what using S3 instead of my current business class shared hosting plan would cost me for one of my audio streaming sites.

I have around 10-15GBs of data which gets rotated regularly, so nothing is kept around for longer than around two months. With the amount of traffic I get, around 70GBs per month, and the number hits I get, I still pay less for my business class shared hosting plan than I would pay for S3.

And I also have full shell access and scriptability.

To top it off my plan actually covers 1TB traffic a month, so my site could scale up to 12 times traffic-wise before I would need to upgrade my account. At that cross-over point Amazon S3 would still be 54% more expensive than my current solution.

This might not apply to everyone, but in my case getting S3 would be plain dumb.

just be aware because you're on a shared host, they may kick you off to a more expensive machine if you start using all that bandwidth.

What hosting service are you using, pray tell?

SmugMug's cheapest plan is $40/year, which is $3.33 a month

S3's most expensive tier is $0.15gb/mo

So, not counting bandwidth, a user would have to store roughly 22GB of images to eat up that cost.

Being conservative, a high-res JPEG out of camera (12mp) is probably 3mb on average. So, we're talking about over 7,000 images.

Even if they lost a little money on every person who maxes out their account they'd make plenty of excess on the vast majority of people who barely use their account. The same way ISPs rely on oversubscribing their connections.

FYI, most people using high-res DLSR's are storing data in .raw format, instead of a lossy (jpeg) format.

My 12.8MP Canon EOS5D produces .RAW files of around 13MB each. I don't store in .jpg, but generally post-process a jpg thumbnail and mid-size image to make browsing my photo catalog easier (or for emailing samples).

Edits to pics are always saved as a sub-rev, so it's not uncommon for all the files associated with an image to add up to 20-40MB. This is fairly standard for anyone who shoots for $$ (full time, or for a hobby as I do).

Anyway, kind of off topic, but your 3mb number is too low :)

I'm fully aware of the raw file aspect (I shoot professionally with various gear, including a 5D). I didn't consider it in my calculations for two reasons:

- SmugMug doesn't include it RAW file storage as part of the their accouts. You have to pay extra for it: http://www.smugmug.com/help/smugvault

- Most pro photogs are using services like SM after first editing their raw files locally (in Aperture, Lightroom, etc.) and then exporting high-res JPEGs.

I have a 5d and use flickr, where I typically upload jpegs at full res and the highest quality settings. It's not uncommon for me to run into flickr's 10 meg limit. I'd say the average jpeg size that I upload there is more like 7 mb.

I personally haven't seen a lot of 7+mb JPEGs from a 5D but I'm sure it's possible. When I output images from a wedding at level 10, they generally range from 2 to 5mb. My editorial work usually gets exported as tiffs, so those are huge.

I'd consider saving your JPEGs at level 10 in Photoshop (max is 12). It's unlikely you'd ever need or notice the difference. In fact, my print lab specifically requests this.

I do always use 12, in fact I generally crop first before reducing the quality. I probably don't notice the difference, but since flickr rescales to make several copies I always felt that uploading at the highest quality was a good idea.

My 5D Mk II just shipped. Right around the time I posted this comment. I feel serendipitous.

Besides all the other points in favour of S3, I'd like to add one small note: in any business, you want to shift your fixed costs over to variable costs.

I use S3 for small things... it can't be beat if you only need to back up a few gigs. I agree, though, that once you scale up it gets pretty expensive. I host my own backups for the heavy stuff. (there are other providers willing to host your backups for cheaper as well, if you are willing to buy in terrabyte-size chunks)

That Amazon S3 is cheaper than traditional dedicated servers is a myth that's been spreading for a while now.

You can argue all you want about how reliable and convenient it is. But it's not cheaper. In fact, it can be very expensive especially as you consume more bandwidth and storage.

It's not so much arguing that it is cheaper, but that there is still rationale for choosing it even though it's more expensive.

Of course it gets more expensive the more you use it - but so does doing it yourself. And not on a nice smooth scale, either.

There are certainly many cases for using S3 even though it is more expensive. There is nothing wrong with that.

It's just that I have met way too many people who after describing my situation, one of the first things they say is "have you tried S3? shouldn't it be cheaper?"

That depends on how steady your demand is. If once a year you spike to 1000x you base load S3 would be cheaper for you.

If your going to keep 3 copy's of 5+ TB over 2 data centers then S3's costs for storage is fairly reasonable but a little pricey. But if your quickly growing you need to compare the month to month cost of data storage vs the cost of new hardware.

You forget management costs -- someone has to deal with this storage, service the computers, replace the drives, add them, add additional computers, etc.

Indeed. If your business relies on keeping 1TB of data alive, you're going to have dedicated staff keeping an eye on it.

So now, you're looking at ~$2k hardware + $150k salary for your dedicated box watcher.

Compare that to $3k for S3 where it's their problem to keep your stuff backed up and deal with hardware problems.

It's a no-brainer if you ask me.

I haven't looked back on the smugmug case since that came out, but just wanted to note that 600TB would cost $0.12/GB/mo not $0.15.

It's a stepped rate. You pay .15 for the first 600, then .12 for all aditional data beyond that.

Ah, I forgot about the stepped thing, thanks. It's 0.15 only for the first 50TB though:

$0.150 per GB – first 50 TB / month of storage used

$0.140 per GB – next 50 TB / month of storage used

$0.130 per GB – next 400 TB /month of storage used

$0.120 per GB – storage used / month over 500 TB

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact