Hacker News new | past | comments | ask | show | jobs | submit login
I ended up paying $150 for a single 60GB download from Amazon Glacier (medium.com/karppinen)
639 points by markonen on Jan 17, 2016 | hide | past | favorite | 222 comments

I'm the first one to admit that Glacier pricing is neither clear nor competetive regarding retreival fees. I do think that a lot of people use it the wrong way: as a cheap backup. I use:

1. My Time Machine backup (primary backup)

2. BackBlaze (secondary, offsite backup)

3. Amazon Glacier (tertiary, Amazon Ireland region)

I only store stuff that I can't afford to miss on Glacier: photos, family videos and some important documents. Glacier isn't my backup, it's the backup of my backup of my backup: it's my end-of-the-world-scenario backup. When my physical harddrive fails AND my backblaze account is compromised for some reason, only then will I need to retrieve files from Glacier. I chose the Ireland region so my most important files aren't even on the same physical contintent.

When things get so dire that I need to retrieve stuff from Glacier, I'd be happy to pony up 150 dollars. For the rest of it, the 90 cents a month fee is just a cheap insurance.

I use Arq to backup my files (on OS X) to Amazon Glacier, Arq can enforce a budget which i've set to 5€/month.

Also the dialog for restoring files is very clear in showing costs vs speed https://www.arqbackup.com/documentation/pages/restoring_from...

Haven't had any problems, i'd recommend it.

I have a similar tiered setup, but didn't add Glacier or Nearline yet.

1. Synchronization across multiple machines using Bittorrent Sync. 30-day archive on local machines and one remote (encrypted-only). One local machine has the archive set to non-deleting.

2. Time machine backups of my primary machines.

3. Encrypted backups through Arq to OneDrive.

I'll probably soon add encrypted Arq backups to B2 or Nearline. OneDrive is practically unbeatable price-wise. I work at a university, so have the academic discount. That's 65.97 Euro for four years, or ~1.37 per month for 1TB of storage space (though I only use a fraction of that).

Similarly, I have a NAS at my house pull stuff from all the other computers and back it up to its drives (snapshotted with ZFS to avoid deleting files and realizing it days later), and borg to upload to rsync.net. I paid $54 per year for 150 GB, which is pretty cheap.

Glad to hear you are happy :)

For those that don't know, rsync.net now has full, native[1] support for both attic and borg.

Our longstanding "HN Readers" discount make it very affordable. Just email us.

[1] Native, as opposed to pointing attic to a local sshfs mount point that terminates at rsync.net, which is what we used to offer to attic/borg users. We don't have a python interpreter in our environment, so it was not easy to provide those tools. We solved the problem by (cx)freezing the attic and borg python tools into binary executables. So, still no python in our environment (reducing attack surface) but the ability to run attic and borg just like they are meant to be run.

fyi, there's also Amazon Cloud Drive which includes unlimited storage for $59.99/year.

That's a great price, but I don't think it works with borg, which is just fantastic. Hands down, the best of the programs I tried (it's a fork of attic: http://www.stavros.io/posts/holy-grail-backups/)

My backup is a RAID 1 NAS which I backup to two external drives once a year. One goes in the shed and the other to my Dad's house. Disappointingly amateurish... but cheap and more secure than any online backup.

I like your commitment to backups. Keep it up!

It's not the worst idea but at the same time it feels almost like data extortion. Like Crassus showing up outside a burning building, "Hey you can get your data back for way above market rate!"

It would help if there was a more clear pricing structure.

Isn't a pricing structure like that kind of the inverse of insurance premiums?

You can either pre-ammortize over the storage period, or you can backload at retrieval time. Furthermore, charging at retrieval time also sends a clear message on what type of data this service is to be used for and modifies user behavior.

What about Amazon Cloud Drive? It's like $60/year for unlimited and seems to work just like Dropbox/OneDrive.

I think the catch with any unlimited drive solution (other than not being truly unlimited in the fine print) is they typically require you to keep the files on your computer.


With the time machine backup and his backblaze gone as well?

There is no perfect backup solution, and I'd be surprised if this one failed in my lifetime.

If all of these fail at the same time I guess we'll have other things to worry about than some photos and documents...

Sorry about that. I just deleted my comment, it made no sense.

Should go get a coffee.

Or go to sleep? (or maybe coffee, timezones are weird.)

9:40am here. Definitely coffee time.

Glacier pricing has to be the most convoluted AWS pricing structure and can really screw you.

Google Nearline is a much better option IMO. Seconds of retrieval time and still the same low price, and much easier to calculate your costs when looking into large downloads.


There's also Backblaze B2 (public beta): https://www.backblaze.com/b2/cloud-storage.html

Their pricing is great (0.5c/month) but I'm a little worried about their single DC.

3 days ago they sent a newsletter about their new (alpha quality) b2sync tool which essentially is a “rsync to backblaze” utility.

This makes their offer very interesting.

Can this tool offer deduplication (so that changing a folder name does not re-upload thousands of files) or is that something I would have to code into my own backup solution?

No, it is even less than rsync. It can only upload a folder recursively and skip files with common modification time on local and remote.

Considering though that before that you couldn't even upload a directory, only files, this is a huge step. :)

I currently need some help whipping up a solution for this.

git with lfs

I hear people say Backblaze is a single DC frequently, but I've never been able to find anything on their site confirming or denying.

Is there any source for the single DC?

EDIT: Nevermind. I did some more googling & found one:


They may also take a while to have the same level of software support, since their API isn't compatible with the S3 API. I know of at least one online backup software provider that works with multiple storage backends but which when asked did not intend to support B2 (since Backblaze is technically a competitor).

I think the best option for backups at AWS these days is the "Infrequent Access" storage class introduced a few months ago (probably as a reaction to Nearline): https://aws.amazon.com/blogs/aws/aws-storage-update-new-lowe...

It's almost as cheap as Glacier, but requires no waiting and has no complicated hidden costs, just simply somewhat higher request pricing, a minimum 30 days of storage and an extra $0.01 per GB for data retrievals.

Sadly, there's still the "small file gotcha":

> Standard - IA has a minimum object size of 128KB. Smaller objects will be charged for 128KB of storage.

Rounding up all small files to 128 KB can be a huge deal. I for example use Nearline to directly "rsync" my NAS box for offsite backup (yeah I know, I should use something that turns it into a real archive or something, but I'm lazy and Synology has this built in). If those hundreds of thousands of (often) small files were rounded up, S3-IA would easily be more expensive than S3/GCS.

Disclaimer: I work at Google on Compute Engine and am a happy Nearline customer myself.

You can also try the duplicity backup tool. It uses tar files to store the primary backup and its binary incremental deltas. It supports a number of backup back-ends, but you can also just generate the backup to a local disk and then rsync it to a remote location.

imo only crazy people upload anything to the cloud without compression+encryption.

You could solve that by using tar/zip to bundle your backups, but obviously that's an extra step.

They're both difficult to grok but for different reasons. (Well, the same underlying reason - bandwidth - just accounted for differently).

Glacier, as the article points out, is difficult to grok because there are provisioned bandwidth costs.

Nearline is difficult to grok because it simply caps your bandwidth at 4 MB/s per TB stored. It starts at 1 MB/s.

Basically, both are optimized to make retrieval slow and/or expensive because the cost optimizations they're doing internally are incentivizing very cold data, very big data, or both.

Note that with "on-demand I/O" you can basically remove this I/O throttle (best effort though) at an increased cost. Nearline is still cheaper than dealing with glacier (or even S3 Infrequent Access) particularly for small files.

Disclaimer: I work on Compute Engine (GCE) but not Cloud Storage (GCS).

>Glacier, as the article points out, is difficult to grok because there are provisioned bandwidth costs.

It does go against their slogan "only pay for what you use", though. You're not using that 15GB/hour for 744 hours, and haven't explicitly selected provisioned IOPs.

> It starts at 1 MB/s.

Not a problem all you can get is a crappy 8 Mbps ADSL. Oh well, let me whine... ;)

I'd point out that we're generally trying to offer simple, competitive pricing across our services (we're not totally there yet).

This write up is a pretty good summary of our differences on the compute side: https://news.ycombinator.com/item?id=10922105

Disclaimer: I work on Compute Engine, and specific to that article launched Preemptible VMs.

My main problem with the Google Cloud Storage as an individual customer in Germany is this: https://support.google.com/cloud/answer/6090602?hl=en

Since I can't find a way to set my account status to "personal" I'm somehow personally responsible to pay VAT (I wouldn't even know how to do that) if I don't want to risk getting problems with our version of the IRS. That doesn't sound like a nice way to do my personal backups.

On AWS that's not a problem, Amazon bills and pays the VAT for me.


That's all any of us on Google Cloud can usefully say about the damn VAT issue. We've got people working on it, but it's a Google wide issue (and I honestly don't understand the last mile enough to know why we don't do exactly what you describe AWS doing).

As a side reply to the poster below, it's not true that we're trying to exclude individuals. There are lots of happy, individual customers all over the world. We unfortunately just didn't get out ahead of this personal service / VAT thing as a company. So again, Sorry! And someone is working on it, but don't expect any immediate fixes when taxes/regulations/laws are involved.

As nonsensical as it is, you're not meant to be using Google Cloud Platform as an individual user at all.

From that same page: "Google Cloud Platform services can be used only for business purposes in the European Union."

Seems like AWS has capitalised on the gap to target the hobbyist AND enterprise markets, whereas google doesn't really want that.

So pass this up the chain if you can: I've used both Azure and (continue to use) AWS, but I've never seriously even considered Google Cloud Platform because I feel like Google's service offering is not stable. I know inherently that this is because of the way Google treats its consumer offerings, but I can't help but imagine that it applies to the business offerings as well.

I hear you. Those of us that care regret the shutdown of Reader, as it has almost certainly cost us more business dollars than running it in perpetuity may have.

That said, Google (like most businesses?) hasn't cancelled any product that is making lots of money. At Google, that means Ads, Apps, and yes Cloud. We're here to stay, and unlike on the consumer side Cloud has an explicit deprecation policy for products that have "gone GA" (graduated from Beta).

Big picture items like Compute Engine, Google Cloud Storage, etc. aren't going anywhere. We might deprecate our v1 API after we're on v2 of course, but we're not cancelling the product (again, it makes us money, and that's a pretty key distinction from the consumer products Google has historically shuttered).

Disclaimer: I work on Compute Engine and haven't personally cancelled anything yet ;).


Entering exactly the phrase "Does Google cloud have any plain VMs (running CentOS for example) available" into Google search results in, as a first result, a link explaining exactly how to launch one of those [0].

Your comment doesn't really relate to the parent, to this thread and is a question that is trivially easy to answer. There's no point in hundreds-to-thousands of HN readers seeing your comment which can be answered in under a second and anyone who's used GCE already knows the answer to.

Please considering querying Google, or a relevant knowledge base, before querying a tangentially related HN comment chain.

[0]: https://cloud.google.com/compute/docs/operating-systems/linu...

You're absolutely right, my apologies. -- I recall looking for this quite a while ago without success, but I should have done a fresh search before posting.

Yes, Compute Engine is what you're after (https://cloud.google.com/compute/). Those predefined things are getting started / "click to deploy" for people that don't want to bother.

Disclaimer: I work on Compute Engine.

Yes, there are Debian, Ubuntu, Centos, SUSE, CoreOS, RHEL and Windows Server images available for launch. You should see them in your cloud dashboard as choices when you go to launch an instance.

OP here. Some updates and clarifications are in order!

First of all, I just woke up (it’s morning here in Helsinki) and found a nice email from Amazon letting me know that they had refunded the retrieval cost to my account. They also acknowledged the need to clarify the charges on their product pages.

This obviously makes me happy, but I would caution against taking this as a signal that Amazon will bail you out in case you mess up like I did. It continues to be up to us to fully understand the products and associated liabilities we sign up for.

I didn't request a refund because I frankly didn't think I had a case. The only angle I considered pursuing was the boto bug. Even though it didn't increase my bill, it stopped me from getting my files quickly. And getting them quickly was what I was paying the huge premium for.

That said, here are some comments on specific issues raised in this thread:

- Using Arq or S3's lifecycle policies would have made a huge difference in my retrieval experience. Unfortunately for me, those options didn't exist when I first uploaded the archives, and switching to them would have involved the same sort of retrieval process I described in the post.

- During my investigation and even my visits to the AWS console, I saw plenty of tools and options for limiting retrieval rates and costs. The problem was that since my mental model had the maximum cost at less than a dollar, I didn't pay attention. I imagined that the tools were there for people with terabytes or petabytes of archives, not for me with just 60GB.

- I continue to believe that “starting at $0.011 per gigabyte” is not a honest way of describing the data retrieval costs of Glacier, especially when the actual cost is detailed, of all things, as an answer to a FAQ question. I hammer on this point because I don't think other AWS products have this problem.

- I obviously don't think it's against the law here in Finland to migrate content off your legally bought CDs and then throw the CDs out. Selling the originals, or even giving them away to friend, might have been a different story. But as pointed out in the thread, your mileage will vary.

- I am a very happy AWS customer, and my business will continue to spend tens of thousands a year on AWS services. That goes to something boulos said in the thread: "I think the reality is that most cloud customers are approximately consumers". You'd hope my due diligence is better on the business side of things, as a 185X mistake there would easily bankrupt the whole company. But the consumer me and the business owner me are, at the end, the same person.

Glacier's pricing structure is complicated, but fortunately it's now fairly straightforward to set up a policy to cap your data retrieval rate and limit your costs. This was only introduced a year ago, so if like Marko you started using Glacier before that it could be easy to miss, but it's probably something that anyone using Glacier should do.

http://docs.aws.amazon.com/amazonglacier/latest/dev/data-ret... https://aws.amazon.com/blogs/aws/data-retrieval-policies-aud...

> fortunately it is fairly strait forward to cap your data retreival rates.

Amazon has done a great job with this feature. By doing a poor job implementing something for an extremely narrow usecase, in a technology that is outdated and then providing the most complicated pricing structure surrounding every aspect of the product one can't helpbut use the feature: any other provider or service.

Like, wtf would be the usecase for amazon glacier in 2016? I dont think I would put hubdreds of petabytes of sata into 20 year cold storage, and the author of this post certainly wouldnt use it again. The fact that i need to read 2 pages of pricing docs and then the 2 pages you linked to control them because I cant estimate them myself, is a sure sign this is absurd

SOX compliance, legal requirements to save communications, etc. There are a lot of places where there are needs to maintain a huge amount of information that you're probably never going to need again.

Not all products are for all people. If you foresee a need to recover a large amount of data all at once, then glacier's not for you. If you might occasionally need a filing from 6 years ago, then glacier would be great.

It's not about recovering a large amount of data; its about recovering a large % of your own stored data.

Amazon starts to charge you extra anytime you exceed restoring 5% of the data.

If for example you save all your tax-related documents in Glacier, then you are audited then the accounts department or the government will want all the information. Not 5% of it. Not 10% of it. Everything. At that point Amazon will have you over a barrel, because getting the data out at a reasonable time frame will cost exponentially more than dripping out the data over the course of 20 months.

> Amazon starts to charge you extra anytime you exceed restoring 5% of the data.

Isn't one way to get past this... increasing your data usage by 20x? If OP used less than a $1 a month, then if he uploaded $20 of junk data, he can get the 5% original data back "for free". Sure, it's $20, but it beats out $150+.

It looks like it is even more complicated than that. You can get 5% out per month at no charge, but that it only if you spread it out across the entire month. The extra charges happen the first time you exceed 5%/30 in a single day.

A better way would be if you had 20+ categories of data that are totally unrelated, like; your tax-stuff, your code, your diary from 1995-2010, ... Since these are very unrelated, you are not likely to need every one of at the same time, ASAP.

Though it's hard for me to imagine having so many categories of unrelated, useful and important data.

Glacier is about storing a large amount of data in a cost-efficient manner, when you do not anticipate needing it 1) cheaply and 2) quickly.

Legal issues, at least in the US, do not have those requirements.

I have had situations in the past 12 months where recovering past data would be worth >$1k for the right 100k of data.

>If for example you save all your tax-related documents in Glacier, then you are audited then the accounts department or the government will want all the information. Not 5% of it. Not 10% of it. Everything.

Are you sure about that? I haven't worked with tax litigation specifically, but I've worked with e-discovery w.r.t. e-mail and I can assure you that no one ever asks for all the e-mails sent by a particular company over all time. It's always a matter of asking for the e-mails sent by or received by a select group of people, over a fairly discrete time period. For something like this, a Glacier store might make sense, if it was coupled with an online metadata cache stored in e.g. S3.

With tax litigation the issue is that you have to prove you didn't simply shift money and accounting briefs around, and the only way to realistically prove it is to show all the statements in the time period that you'd required to keep them (I think that's the last 6 years).

The government basically comes to you and says they think you owe X, and you have to prove that false to their satisfaction. The more data you give your CPA to work with, the better.

There are some use cases listed here, and they seem pretty reasonable to me: https://aws.amazon.com/glacier

Lots of businesses have data retention requirements, and it can be difficult and time consuming to make sure this data is backed up in a way that is secure and can survive a catastrophe.

The author's use case (and most other personal use cases) might not be a good fit for Glacier, but he's not the target market.

Tape storage is still the most optimal form of long-term storage. If you need to store things for an exceptionally long time, such as financial data, scientific data, etc, then you're going to get the most bang for your buck on tape.

The post states he paid $150 yo retrieve 60gb of data. For $150 you can buy a 5tb hdd drive.

What usecase could the price delta make sense to have a 4 hour feedback loop and all of your important data locked in someone elses data center.

the usecase where your data is so properly massive that this makes sense && you don't have the storage infrastructure in place, is so narrow that it doesnt make sense.

It is basically one research student's crawl data

Edit: also, s3 is pretty cheap. So again, i dont really see the usecase here. How much room is in the market between your own physical or digital system and amazon s3 or an equivalent. you would have to have a massive amount of data you dont care about and be very price sensitive.

A user posted a long example below: https://news.ycombinator.com/item?id=10921709

You don't have to pay $150 for retrieval of 60GB. And you don't do long-term storage for X TB / 5 TB * $150. You might have to rent space in someone elses datacenter to put your own external backup... or you could pay Amazon for Glacier and not deal with maintenance etc. Might be worth it even if you have infrastructure for all data that isn't glacier-cold.

Durability is one reason. A physical hard drive in my desk is so much more susceptible to destruction or loss or theft.


The data I care about is already backed up on two different multi-TB drives at home, and another one at work.

Glacier is the contingency for "something took out the original data and all three backups in two different locations 7 or 8km apart - if I'm still alive after whatever just happened, I'll consider whether or not to pay Amazon a grand or so to retrieve it quickly from Glacier, or wait ~20 months to get it all for single-digit-dollars".

If you're talking about personal data, why not just use backblaze and amazon's consumer unlimited cloud storage?

That gives you 2 backup providers that can durably store everything and it's free and quick to access. Why deal with all the harddrives and glacier?

Neither of those seem free - am I looking in the wrong place? Also there's no Linux client for either AFAICT.

Right the services cost money but retrieval is free. Going by the cost of the harddrives amortized out, it'll probably be the same or less. You get far more durability and less complexity with universal web access.

I believe there are other similar services for Linux or you can just use browser to upload files with Amazon.

He paid $150 plus between 50 and 80 CENTS a month to hold a backup of 60GB for four plus years.

Yes but if you were in this target market you'd likely want your information on your own tape with a reading infrastructure you control.

Depends. Having an in-house key and shipping everything to Amazon encrypted means that you have all the infrastructure there and waiting, and not capable of being read. Additionally, that tape library would need to be stored, and periodically tested so that tapes can be rotated out as needed. Sending that data to a service like glacier means you've shown due diligence, but at the same time, don't need to maintain a schedule of testing every disk every year.

don't need to maintain a schedule of testing every disk every year.

So you trust Glacier or Google Nearline (or any similar provider) without testing? No testing ever???

I wouldn't feel happy with my critical data floating around in the cloud, without my checking it at least once a year to make sure it really exists.

And once you do start verifying that data, you will incur all sorts of charges to access it.

this was what I meant. thanks. You can buy a 5tb hard drive for ~138.00. You could likely buy several of them at discount to get started. As you go forward in time, these will become much cheaper, allowing you to continue purchasing them on demand from the market for much less money.

This allows you to trivially share, copy, move and retrieve that data quickly as well as fully control who has access and when.

I am sure there are use cases for this but in a situation where you have petabyte scale data, it is often the case that you also have the infrastructure to save it. How many places would need to store >5tb of data a week that

* don't have this capability in house

* will almost never need to access it again.

* will not need it in a timely manner, if they do need to access it again.

* don't have the money to implement their dedicated server and storage on site for this purpose.

I am not saying that this rules everyone out, but the prices are so low, and tape must be so annoying, I couldn't imagine why they keep offering this. Obviously, some peopel must be using it but in 2016 with storage prices being so low already, i don't know how many places have this amount of data and meet the above requirements.

2 Words: Offsite Backup

Arq has a fantastic Glacier restore mechanism. You select a transfer rate with a slider, and it informs you how much it will cost and how long it will take to retrieve. It optimizes this with an every-four-hours sequencing as well. See https://www.arqbackup.com/documentation/pages/restoring_from... for reference.

It's unfortunate that Arq forces you to archive in their proprietary format. That locks you in to the tool.

I was also concerned about this when I was looking into Arq, so I wrote a cross-platform restoration tool that'll also work on Windows and Linux (not just Mac): https://github.com/asimihsan/arqinator

This is purely based on the author's excellent description of the format in his arq_restore tool: https://www.arqbackup.com/s3_data_format.txt

No it doesn't. They've had an open-source CLI tool since at least 2013.


I just started using Arq, but it looks like the format is documented here https://www.arqbackup.com/s3_data_format.txt and there's some open source tools that can work with the format. https://github.com/asimihsan/arqinator and https://godoc.org/code.google.com/p/rsc/arq/arqfs

The only use case I would be willing to commit to glacier would be legal-hold or similar compliance requirement.

The idea would be that the data would either never be restored or you could compel someone else to foot the bill or using cost sharing as a negotiation lever. (Oh, you want all of our email for the last 10 years? Sure, you pick up the $X retrieval and processing costs)

Few if any individuals have any business using the service. Nerds should use standard object storage or something like rsync.net. Normal people should use Backblaze/etc and be done with it.

Back when I worked in banking we had requirements like that (though we didn't use glacier)

We had a legal requirement to be able to product up to 7 years worth of bank statements upon receipt of a subpoena.

Not "reproduce the statements from your transactions records" but "give us a copy of the statement that you sent to this person 6.5 years ago"

We had operational data stores that could generate a new statement for that time period, but if we received the subpoena then we needed to be able to produce the original, that included the (printed) address that we sent it to, etc.

We had (online) records of "for account 12345, on 27th October 2011, we sent out a statement with id XYZ", we'd just need a way to pull up statement XYZ.

There's no way(^) we'd ever get subpoenaed for more than 5% of our total statement records in a single month, so something like Glacier would have been a great fit.

We had other imaging+workflow processes where we'd receive a fax/letter from a client requesting certain work be undertaken (e.g a change of address form). 90 days after the task was completed, you could be pretty sure that you wouldn't need to look at the imaged form again, but not 100% sure. We could have used glacier for that.

We use case that would have cost us (rare, but we needed to plan for it) was "We just found that employee ABC was committing fraud. Pull up the original copies of all the work they did for the 3 years they worked here, and have someone check that they performed the actions as requested." Depending on circumstances & volume that might trigger some retrieval costs, but the net saving would almost certainly still be worth it.

(^) Unless there was some sort of class action against us, but that's not a scenario we optimised for.

I'm happy enough to use it as a "third copy" - for never-expected-to-be-used recovery if both my local and remote backups fail.

I know it'll take either a lot of time or money to restore from Glacier, but if my home and work backups have both gone I'll either not care about my data any more, or I'll be perfectly happy to throw a grand or so at Amazon to get my stuff back (or, more likely, be happy to wait up to 20 months for the final bits of my music and photo collections to come back to my own drives).

Glacier is not a cheap/viable backup

its even less suited to disaster recovery (unless you have insurance)

Think about it. For a primary backup, you need speed and easy of retrieval. Local media is best suited to that. Unless you have a internet pipe big enough for your dataset (at a very minimum 100meg per terabyte.)

4/8hour time for recovery is pretty poor for small company, so you'll need something quicker for primary backup.

Then we get into the realms of disaster recovery. However getting your data out is neither fast nor cheap. at ~$2000 per terabyte for just retrieval, plus the inherent lack of speed, its really not compelling.

Previous $work had two tape robots. one was 2.5 pb, the other 7(ish). They cost about $200-400k each. Yes they were reasonably slow at random access, but once you got the tapes you wanted (about 15 minutes for all 24 drives) you could stream data in or out as 2400 megabytes a second.

Yes there is the cost of power and cooling, but its fairly cold, and unless you are on full tilt.

We had a reciprocal arrangement where we hosted another company's robot in exchange for hosting ours. we then had DWDM fibre to get a 40 gig link between the two server rooms

The post is a useful cautionary tale, and he's not alone in getting burned by Glacier pricing. Unfortunately it was OP not reading the docs properly.

Yes, the docs are imperfect (and were likely worse back in the day). And it was compounded by the bug, apparently. But it's what everyone on HN has learned in one way or another... RTFM.

Was it mentioned in the article that the retrieval pricing is spread over four hours, and you can request partial chunks of a file? Heck, you can retrieve always all your data from Glacier for free if you're willing to wait long enough.

And if it's a LOT of data, you can even pay and they'll ship it on a hardware storage device (Amazon Snowball).

Anyone can screw up, I'm sure we all have done, goodness knows I have. But at the very least, pay attention to the pricing section, especially if it links to an FAQ.

I would say the FM in this case was unreadable. “starting at $0.011 per gigabyte”, "learn more" - no one would expect to pay $150 here.

It's definitely sleazy of Amazon to hide the pricing info like that, but I can't imagine seeing that sentence while making a purchase and not clicking through to see how the pricing actually works. But I guess that's just me.

Not just you. But read the page referred by that "Learn more" link. It's very unclear. I mean, or course, it has unambiguous meaning. But to understand it you need to read a lengthy sheet of prose, and only deep in it, after many definitions, you must notice a phrase "we multiply your peak hourly billable retrieval rate" ... "by the number of hours in a month". What concentration of attention, and how much time a reader need to spend, to understand this important nuance?

As far as I see, in all other places they specify their pricing "per GB", and only this small phrase uncovers real meaning, which is not per actual GBs you transferred, but your peak rate multiplied by number of hours in month. IMHO this should be one of the first phrases describing the pricing model.

>Glacier is designed with the expectation that retrievals are infrequent and unusual, and data will be stored for extended periods of time. You can retrieve up to 5% of your average monthly storage (pro-rated daily) for free each month. If you choose to retrieve more than this amount of data in a month, you are charged a retrieval fee starting at $0.01 per gigabyte. Learn more.

If you want more than 5% of your data in a month, the minimum it will cost you is $0.011 per gigabyte. Click here to see how to work out how much it will cost.

It's certainly not unreadable, although to be fair, it's nowhere near as clear as it should be.

It's not unreadable, but is designed to make it difficult to realize that typical usage might result in very significant costs.

'Use Amazon Glacier, it costs about a cent!' is the essential pitch.

I don't think he really was "burned".

Paying 87c a month for a couple of years, then 52c a month for a few more years, to back up 60+GB, and then getting a one off fee of $150 still averages out at around $2/mo. Hardly getting shafted.

Depends on your budget. If you don't have $150 disposable income to blow on a screwup, it's an expensive mistake.

It's like getting a parking ticket for parking somewhere you've unknowingly been parking illegally for years. Yeah, if you average it out, it's cheap. But having to stump up the cash still hurts.

60Gig is what - maybe 200 CDs losslessly compressed? That's a couple of grand's worth of disposable income he entrusted to Amazon and his own management skills (assuming he no longer has access to the original disks to re-rip them (and ignoring the copyright implications of that)).

I guess he always had Linus's backup strategy open to him - "Only wimps use music backups: real men just upload their important tunes on Bittorrent, and let the rest of the world seed it ;)"

Meta, Response to sub-comments[meta enough yet?]: Ah, that makes way more sense. Sorry for labelling the parent a troll. This was clearly my mistaken interpretation. It was weird. As I tried to check out the math on the parent's assertions, they came out an order of magnitude too high like 3 times in a row. So I concluded that the parent was intentionally inflating the relevant costs to misdirect the discussion. However, it was really just my failure to consider the perspective of the parent, aligned with a loose enough situation that the mistake was possible. Again, I apologize for the misapplied label. I'll leave the original comment below so as not to rewrite history on how I was (due to sincere misunderstanding, I maintain) kind of a jerk :/

A couple grand? 500GB hdds have been less than $100 for a long time now, certainly since before Amazon Glacier was a product.

Just because you can formulate a large valuation for something doesn't mean that it's a reasonable valuation.

Even $10/cd is a ridiculous estimate of the cost of blank CDs. If you insisted on using blank CDs for backup, you could do 200 CDs for $40.

Edit: And what are CDs... 640mb? that's like 12 CDs, not 200... I'm realizing that I've fallen into a troll trap...

Meta: I guess I'll leave this comment up as a cautionary tale.

Just in case you are confused, the person above you is talking about the cost of the music ON the CD, not the CD itself.

Indeed. I'm guessing 200-ish albums, purchased at ~$10 each, would take up 60G when losslessly compressed. (As pointed out, 640MB/disk would mean just under 100 "full" audio CDs as bit for bit copies, so 200 is probably a quite low estimate).

I bought my ~3000 disc CD collection here in Australia, many of them still have price stickers of $30 and more on them. While CDs were "a thing" here, I figure I spent about a quarter of a cheap house on them. (I don't regret a cent though. I'm also "that guy" who thinks "You only had 60Gig of music to back up? Wow!", but is mostly socially-aware enough to keep that sort of reaction to himself.)

FWIW, no offence taken, an understandable mistake to make given my lack of clarity.

> If you insisted on using blank CDs for backup, you could do 200 CDs for $40.

Wouldn't 2 x 100GB Blu-ray discs be better for the same price? Less physical storage space required and less time to burn the data.

That's definitely a valid counterpoint, agreed.

Like you said, a good cautionary tale. Even this incident, although definitely rough, isn't that bad compared to what can happen if you scale a full infrastructure improperly, or just overlook something involving a software-based business. Just a couple of months ago, a test EC2 instance that had been forgotten about got spun up by a very archaic piece of code, and ended up costing the guys I'm working with quite a bit. Always RTFM (especially AWS pricing docs) carefully, or feel the burn.

Snowball can only import data, not export it. Here is the footnote on the Snowball page:

Snowball currently supports importing data to AWS. Exporting data out of AWS will be supported in a future release.

Even after people have RTFM, many are still misunderstanding the pricing model. For instance, people now think you can store 2TB of data and download 100GB of it at once (5%) for free. Nope! You can request a max of 0.16% per day of data for free. Once you exceed 0.16% of your data in any given day, you start getting charged. I think it's fair to say that even with the intent to RTFM, the pricing model is unusually strange and AWS should really put more effort into explaining it.

> The post is a useful cautionary tale, and he's not alone in getting burned by Glacier pricing. Unfortunately it was OP not reading the docs properly.

Well, it was that and also the docs being knowingly and deliberately set up to trick incautious readers. The user does have a responsibility to read the fine print, but that doesn't excuse Amazon being openly evil about it. This is no different than ISPs that advertise "up to 50Mb/s" when they know very well that their network won't deliver more than 5.

This sounds a lot like demand-billing [1] [2] that's common with electric utilities, particularly commercial, and increasingly, people with grid-tied solar installations. [citation needed]

You pay a lower per-kilowatt-hour rate, but your demand rate for the entire month is based on the highest 15-minute average in the entire month, then applied to the entire month.

You can easily double or triple your electric bill with only 15 minutes of full-power usage.

I once got a demand bill from the power company that indicated a load that was 3 times the capacity of my circuit (1800 amps on a 600 amp service). It took me several days to get through to a representative that understood why that was not possible.

[1] http://www.stem.com/resources/learning

[2] http://www.askoncor.com/EN/Pages/FAQs/Billing-and-Rates-8.as...

You don't have a backup until you test its restore.

I've never heard that and I'm stealing it.

Come on, don't downvote someone for learning something (you think is) well known for the first time. Especially when their response is "that's great and I'm going to use it". https://xkcd.com/1053/

The comment is noise - no different from "+1" or "me too". If you found a comment helpful and want to thank someone for it, the way to do that is an upvote.

It's a bit different, I added that I had never heard of the phrase and that I would be using it.

Even with the large data retrieval bill he still saves ~$100 vs the price of keeping that data in S3 over the same time period. Reading this honestly makes me think glacier could be great for a catastrophic failure backup.

Compare to Google Drive at 100 Gigs for2 bucks and no bandwidth charges that I'm aware of.

I looked up the Google drive pricing and the costs increase quickly over that 100GB level. I'm going to use 3TB of data for my example because that is approximately the amount of data I would be backing up at work. The cost for Google Drive would be the $99 a month 10TB plan, Amazon Glacier is $21.51 a month. This is before you get into things like having the enterprise AWS ecosystem with IAM versus a single user Gmail account. Remember I am only talking about retrieval in the case of a catastrophic failure, the data is already backed up elsewhere. As long as I can manage to go a year without destroying all my backups Glacier comes out on top over Drive even taking into account the retrieval fees. In the best case scenario I never ever retrieve that data.

If one can download a percentage for free each month - 5% in this case, and the price of storage is dirt-cheap, then couldn't one just dump empty blocks in until the amount desired for retrieval falls under the 5% limit? In this case, if one wants to retrieve 63.3 GB, uploading 1202.7 GB more for a total of 1266 GB, 63.3 GB of which represents just under 5%. There's no cost for data transfer in and the monthly cost at $0.007/GB would be just $8.87. And that's just for the one month because everything wanted would be coming out the same month.

Has anyone tried this or know of a gotcha that would exclude this?

And I realize that for the OP's situation, it wouldn't have mattered since he thought he was going to get charged a fraction of this.

I believe that objects cannot be transitioned into the Glacier Storage class until 60 days after their upload date. So at the very least, you will have 60 days of latency (and thus ~$20 of monthly fees) before you can extract your other data.

Additionally, I wouldn't be surprised if the 5% is also based on a storage measurement that is pro-rated for the month. So I would let the 1200 GB of data sit in Glacier Storage for another month before extracting anything, just to be (more) safe.

That's possibly a way to trick the system, but storing 63.3 GB in S3 would cost OP less than 2$ for standard availability and less than 1$ for reduced availability, not counting request costs (which are not as surprising as the hidden costs in question here). At this scale you should just store it in S3 and be done with it.

sure, I wasn't intending to suggest this as a good premeditated maneuver, but there are probably other individuals out there who have found themselves in a similar position to the OP and considering their predicament.

If you've gotten into Glacier for the wrong reason, you may already be in the trap, and you can quickly rip yourself free and take a bunch of skin, spend almost 2 years ever so gently prying yourself free, or maybe a third way. That's my angle here. Also, traps don't have to be laid for someone to feel like he's in one, so I'm not putting that on AWS.

The cheapest way out seems to be to just grab 5%/month over 20 months, but that's a lot of sustained effort and contact with the service. So I could see a trick like this as a potential middle ground, at three months and ~$30 according to previous comment's details.

Glacier is more comfortable to use through S3, where you upload and download files with the regular S3 console, and just set their storage class to Glacier with a lifecycle rule. I've used the instructions in here to do it: https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/

That can bite you if you have a lot of small files[1], which with automatic archiving of S3 can happen.

[1] https://therub.org/2015/11/18/glacier-costlier-than-s3-for-s...

Amazon makes it very easy to transition data from S3 to Glacier. But, if you want to get it back, you will still have to go through all of the byzantine Glacier rules and fees, including the 4-hour wait.

And, there is no "transition from Glacier to S3": if you want to do that, you have to:

1. restore it to S3 (and incur the fees and 4-hour wait) 2. copy the restored S3 object to a new S3 object 3. delete the restored object (or wait for it to timeout)

If downloading more than 5% of stored data is so expensive, wouldn't it have been cheaper to upload a file 19 times the size of the stored data (containing /dev/urandom)? After that, downloading just 5% of total data would have been free.

It still wouldn't have been free. The free download allowance is spread daily across the entire month. That is, you can download 5% of your data per month, and only 0.16% of it per day. For your optimization to work, you'd have to retrieve your data over 30 days for it to be free.

At that point, your monthly fee would be greater than that of a regular S3 bucket, which has fewer barriers to retrieval.

Reminds me of my advice to netflix after peering problems emerged, just push garbage upstream from every client to equalize your peering traffic.

I've had some big Glacier bills in the past, even the upload pricing has gotchas[1]

These days the infrequent access storage method is probably better for most people. It is about 50% more than Glacier (but still 40% of normal S3 cost) but is a lot closer in pricing structure to standard S3.

Only use glacier if you spend a lot of time working out your numbers and are really sure your use case won't change.

[1] - 5 cents per 1000 requests adds with with a lot of little files.

I use Infrequent Access Storage for backups, through a tool called duplicity (or more aptly, I use a GUI front-end for that tool called Deja Dup). Instead of uploading every individual file, it gathers them into 25 mb .tar files & uploads those along with an index describing where each actual file is. That has made the requests negligible for me.

We combined the files too.

And then a customer wanted all their files rather than just one or two. Although that was billed back to the them.

Pricing should always be made straight forward, easy to understand, and that pricing plan is dodgy as hell

Pricing plans FOR CONSUMERS should always be straightforward and easy to understand. This is not a consumer product.

Pricing plans FOR B2B should model, as effectively as possible, the underlying costs -- this allows the provider to offer the lowest possible pricing for the services that cost them the least to provide, with expensive services priced accordingly. As others have mentioned on this thread, utilities are really, really good at this -- they come up with extremely complex rate plans for their largest customers that help them achieve whatever economies they are aiming for, for example incentivizing customers to provide level-loading (which is effectively what Amazon is doing in this retrieval scheme).

> This is not a consumer product.

That's a manufactured excuse for a fundamentally bad product api and pricing structure. When dropbox is more useful than AWS, amazon has screwed the pooch (which they do pretty often). Segmenting users by arbitrary circuitous logic into "consumers" (can't find a good use for it) and "enterprise" (can find a good use for it) isn't constructive. Both classes should avoid it, because it's not even an inexpensive choice, for what you get.

How is Dropbox more useful? Dropbox and AWS Glacier are vastly different products and show exactly the divide between consumer and enterprise that you say doesn't exist.

I think the reality is that most cloud customers are approximately consumers. Some big/sophisticated customers may want precise knobs to tune their spend versus "capability" but many businesses just want you to store their data. Pricing models like Glacier's are scary as hell to any CFO, unless they're shown a plan that says "So we're going to save $XXM on this, 100% certain".

Adding on your utility company meme, there are different rate schedules for residential and business customers in most places. In Cloud, I think negotiated contracts with advanced customers probably make more sense than complex, unfriendly pricing models for everyone.

Big disclaimer: I work on Compute Engine.

Seems like "precise prediction and execution of Amazon Glacier operations" might be a niche product people would pay for (and probably already exists for enterprise use cases?)

That's something that generally keeps me from using AWS and many other cloud services in many cases: the inability to enforce cost limits. For private/side project use I can live with losing performance/uptime due to a cost breaker kicking in. I can't live with accidentally generating massive bills without knowingly raising a limit.

I concur.

I have not tried variety of AWS services because I have no idea what it would cost me if something went haywire on my server.

If I could simply deposit to Amazon a prepaid amount and it would just use this deposit until it's depleted, after which the services I have would grind to halt. This would be a perfect way for me to try it.

You can do it pretty easily with the AWS APIs, and in the process scram-switch only the stuff you really want to kill.

How would you script a protecting against the issue described in the article? Unless you make sure to check the cost before each single request you can't stop it (and if you accidentally send one massive retrieval request even that isn't enough)

For other services it is easier, but even then, setting up and managing my own cost control mechanism is a level of complexity (and risk of failure) I'd really want to avoid, esp. since I probably use AWS to avoid management overhead.

You can't in this particular use case, but I can't envisage AWS providing a cost-control system that would stop this, either. It doesn't make sense--they're not calculating costs as they go. What I'm saying is that what Amazon would provide you is not functionally different from what you can do yourself right now.

I would be a lot more worried about a risk of over-charging myself if AWS wasn't incredibly good about refunding accidental overages.

as user re pointed out above, apparently for Glacier that functionality actually exists: https://docs.aws.amazon.com/amazonglacier/latest/dev/data-re...

Huh, that's...surprising. Glad it exists, though, as Glacier definitely is opaque. Thanks for pointing that post out (to you and to `re).

"The problem turned out to be a bug in AWS official Python SDK, boto."

My only experience of using boto was not good. Between point versions they would move the API all over the place, and being amazon some requests take ages to complete.

After that worked with google APIs which were a better, but still not what I'd describe as fantastic (hopefully things are better over last 2 years).

Wouldn't it be better for the OP to simply upload 20 * 60GB (= 1.8TB) of random data, wait a month (paying less than 20 USD), and then download the initial 60GB within his 5% monthly limit?

No, because he can download the 60GB at a slower, cheaper rate, paying less than the storage for the 1.8TB and getting it sooner than a month.


By requesting the CDs one after another (or small numbers in parallel), keeping the data rate low.

This article claims that glacier uses custom low-RPM hard disks, kept offline, to store data.

Does s/he substantiate this claim in any way? AFAIK glacier's precise functioning is a trade secret and has never been publicly confirmed.

Oh, there also a hacker news thread about this


Considering the fact that bugs in the official APIs resulted in multiple retry attempts, he should demand some of his money back.

The retries don't cost extra. Sending off all 150 retrieval requests for all the data at once set the retrieval rate and the price, they should have been staggered over a lot of time to keep the rate low.

How official is boto?

They should rename this service to Amazon Iceberg

About a year ago NetApp bought Riverbed's old SteelStore (nee Whitewater) product -- it's an enterprise-grade front-end to using Glacier (and other nearline storage systems). It provide a nice cached index via a web GUI that let you queue up restores in a fairly painless way. It even had smarts in there to let you throttle your restores to stay under the magical 5% free retrieval quota. It's not a cheap product, and obviously overkill for a one-off throw of 60GB of non-critical data ... but point being there are some good interfaces to Glacier, and roll-your-own shell scripts probably aren't.

As noted by others here, if you treat glacier as a restore-of-absolute-last-resort, you'll have a happier time of it.

Perhaps I'm being churlish, but I railed at a few things in this article:

If you're concerned about music quality / longevity / (future) portability - why convert your audio collection AAC?

Assuming ~650MB per CD, and the 150 CD's quoted, and ~50% reduction using FLAC, I get just shy of 50GB total storage requirements -- compared to the 63GB 'apple lossless' quoted. (Again, why the appeal of proprietary formats for long term storage and future re-encoding?)

I know 2012 was an awfully long time ago, but were external mag disks really that onerous back then, in terms of price and management of redundant copies? How was the OP's other critical data being stored (presumably not on glacier). F.e. my photo collection has been larger than 60GB since way before 2012.

Why not just keep the box of CD's in the garage / under the bed / in the attic? SPOF, understood. But world+dog is ditching their physical CD's, so replacements are now easy and inexpensive to re-acquire.

If you can't tell the difference between high-quality audio and originals now - why would you think your hearing is going to improve over the next decade such that you can discern a difference?

And if you're going to buy a service, why forego exploring and understanding the costs of using same?

> Assuming ~650MB per CD, and the 150 CD's quoted, and ~50% reduction using FLAC, I get just shy of 50GB total storage requirements -- compared to the 63GB 'apple lossless' quoted. (Again, why the appeal of proprietary formats for long term storage and future re-encoding?)

I did a comparison between FLAC and ALAC (a.k.a. Apple Lossless) on my CD library a few years ago (plus a few 48kHz tracks taken from DVDs), and the difference in total filesize was less than 10% so I doubt that is a major factor. I personally went for ALAC, as it has equal (EAC, VLC) or better support (OS X Finder, iTunes, Windows Explorer, Windows 10 media player, some tagging scripts, iOS) in stuff I currently use. Providing I keep a decoder with the files, its proprietary nature doesn't really bother me - I can always convert to xLAC if desired.

Interesting. My sense was that FLAC was much more widely supported - in terms of my specific cases the Amazon player supports FLAC but I don't think it supports ALAC, and I'd worry about being able to play them on my Android phone.

Interesting, thanks.

I wouldn't use a proprietary format because I could neverbe sure when in the future I'd want to read / re-encode, and what type of systems I'd have available at that timer, other than knowing I'd always have access to free software.

I have some FLAC archives, but I don't use them - so support to play that format hasn't been something I've taken much notice of. Do you normally play your ALACs, or keep an mp3 / ogg / aac version around to actually listen to?

It's not exactly proprietary since Apple open sourced the ALAC code in 2011.


Also, someone has wrapped it to build with different tool chains.


I normally play the ALACs directly (most of my listening is done on my PC, streaming from my NAS). I used to keep an AAC version around for my mobile devices, but never really used it so don't bother any more.

Apple open-sourced ALAC and made it royalty-free in 2011, it's no longer a proprietary format.

Does anyone have a success story for this type of backup and retrieval on another service?

I briefly used Glacier for daily backups as a failsafe if our internal tape backups failed when we needed them. The 4 hour inventory retrieval when I went to test the strategy and the bizarre pricing quickly make me look at other options.

I have a strong feeling that he would get a refund if he contacted Amazon support, considering it was caused by a bug in the official SDK and he didn't ACTUALLY use the capacity he's being asked to pay for.

Multiple requests didn't cause the issue. It was asking for all of his data to be queued in such a small window of time. The same thing would have happened without the bug.

That said, this is Amazon, who will refund you for a product if you ship it to the wrong address accidentally...I'm sure OP could get a refund.

I read it as the SDK had a bug, but that wasn't what caused the costs.

This is why I break my large files uploaded to Glacier into 100MB chunks before uploading. If I ever need them, I have the option of getting them in a slow trickle.

This is no longer necessary, they now allow you to specify ranges of partial files. So you can split a single large file into multiple requests to keep you within allowance or within budget.

Nice. I also upload par2 checksum files for each chunk so changing things now would involve a bit of script rewriting, but that's good to know for the future.

Is that necessary? They report a checksum for every file when you request an inventory.

No, it's not necessary at all, I hope. I probably should have called them just "Par2" files not Par2 checksum files. They aren't just checksums. They allow reconstruction of the original file even after fairly extensive damage. Which isn't likely, so, yes, not strictly necessary, but who knows, could be useful in the what-if scenario. The checksums you mention only help with checking, not with repair.


Perhaps a naive question but why would glacier try to discourage bulk retrieval? Is it because the data is fragmented physically?

We don't actually know how glacier data is stored (though there are several theories from regular s3 to tape robots to experimental optical media).

I suppose part of the neat trick of it is that because we don't have to know, Amazon can switch it out for something else anytime it's convenient for them, or some new tech comes up. Or split thing up among several methods and compare costs. As long as they structure their operations such that nothing is ever lost and everything can be retrieved with a few hours notice at any time, they can try anything they want.

How would us knowing prevent them from doing this?

Whatever information became public, people would write scripts and the like with those assumptions baked in. Even if it wasn't officially documented, it'd be bad PR for Amazon to break things people were relying on.

I'm sure that was a commercial decision, not a technical one.

They could be running Glacier storage at cost (or even a slight loss).

But they make their profit when you try to get your data out.

I'm not implying anything nefarious - more along the lines that Amazon (could have!) looked at the market and compared demand to what they were offering. Then found something that scratched an itch...

In most of these cases, the technical decisions impact the commercial. If writing megabytes were "technically" cheaper than reading, you can either choose to pass that cost onto customers or simplify the pricing model assuming some read-to-write ratio.

They didn't just set the prices (or the pricing model) to amuse themselves.

Disclaimer: I work on Compute Engine.

Because they power down cold racks. It's called Glacier for a reason.

So they have to power them back up. But why would they care whether you then download 1gb or 1tb?

Because that's how they getcha. :-)

For cheap storage there is also Oracle Archive Storage with 0.1c/GB ($0.001/GB). They have horrible cloud management system though.


> I’d need more than one drive, preferably not using HFS+, and a maintenance regimen to keep them in working order.

I'm really doubting the need for a maintenance regimen on a drive which is almost entirely unused. Could have spent $50 on a magnetic-disk-drive and saved yourself hours worth of trouble.

The problem with physical drives is that they can (and do!) fail. The authors point is surely that the backup would need to be checked periodically, and drive failures dealt with.

Does magnetic media like this (especially spinning disk) suffer from bit-rot? What about the possibility of mechanical failure?

I'd never rely on mechanical disks as the one and only backup of any data critical to me - a two tier approach of mechanical for fast retrieval, and cloud/online backup seems to be the safest bet.

You absolutely need a maintenance regimen if you want to be able to reliably retrieve files 4 years later. If you just unplug the drive and stick it in your garage, sure it'll probably work, but I'd say there's at least a 5% chance of failure.

For unique data you want super robust storage options, both local and remote. But for something as generic as ripped CDs? Why bother? Just use an external drive or two if you are super worried about one dying. Even if you lose both drives the data on them isn't impossible to replace.

Wow, thanks for this!

I currently have 100gb of photos on Glacier. I am going to be finding another hosting provider now.

So depending on how the "average monthly storage" is computed you could get 20x more data in one month and then retrieve the 5% (previously 100%) that you care about for free, and then delete the additional data?

There's a 90 day minimum storage cost for data.

You will ALWAYS pay more that you expect when you use AWS (and probably other cloud services). This case is quite extreme, but the way costs are assigned, is quite complicated not to miss something at some point...

I was looking at Glacier for my backups, but it seemed to complicated ... glad I didn't use it.

I ended up using some cheap VPS, two of them located in two different countries. And it's still cheaper then say Dropbox.

Curious: if you use a "general storage provider" (like glacier) for backup, rather than a "pure backup provider" (like Backblaze, CrashPlan) why is that?

I control the encryption algorithms, the compression scheme, the file chunking strategy, the encryption keys, the encryption of archive/file names, the file naming scheme, the addition of Par2 files, everything.

And I don't pay the overhead of an add-on service.

Also I back up stuff that's not on my hard drive (only on external USB drives) and I'm not sure how the services handle that.

If the services give me some of these points, that's not sufficient; they would have to give me all of these points. Only then would I consider them. All things being equal I'd be willing to pay for some convenience but my current solution is all scripted so it's pretty darn convenient.

> I control the encryption algorithms, the compression scheme, the file chunking strategy, the encryption keys, the encryption of archive/file names, the file naming scheme, the addition of Par2 files, everything.

I can see why detailed control would be one reason, but you could still just have a very controlled backup to your own storage location(s) as a first step and just let a backup service bulk store your already named and encrypted files? It's only the last-resort you need to go to so if it's a huge blob of encrypted data that shouldn't matter too much -- you only need to access that in case of a total disaster where you lost all your own backup endpoints first.

> And I don't pay the overhead of an add-on service.

The reason I'm asking is because I was under the impression that backup services are much cheaper than pure storage, while still offering some conveniences such as versioning/backup apps. Glacier charges $0.007 per GB per month, that's $7/month just for a single 1TB machine, just for a single version (If my math is correct, it's early)! If you have dozens of versions it quickly adds up.

I do 10 machines at around 1TB on average, unlimited storage in unlimited versions, at $1.25 per machine per month (flat rate, regardless of storage volume). I have tried building my own machines, tried looking at storage providers etc., but can't get near.

Even if I did only 1-2 machines, the cost in Glacier would break the backup service cost already at a couple of TB total storage.

Great analysis, but why "per machine"?

I back up data, not machines. I would prefer that third party backup software not have access to the unencrypted files on my machines.

The pricing you mention is compelling. Which service is that? Does your model work if the data is on multiple external drives?

Getting down to a specific example, say I have one laptop and three external drives. Do the backup services you have in mind work with this setup? How would they charge?

The service I use is the CrashPlan "family" plan. If I would just gather all the data to one machine instead it would work too (could just keep it on a file server with a directory per machine or whatever). The only reason "machines" are useful is for having separate easy restores -- basically my parents or mother in law can restore their machines from the other side of the country.

As long as you just keep the external drives connected to the same machine during backup, the backup application doesn't care where the file set to backup is mounted. its just a directory list per machine.

I used to backup "data" too, by using a first step of backup of multiple machines to a NAS and then only backup that data to the cloud. However, I sacrificed that extra security for the added convenience of direct restore of individual files and machines. It also reduced the risk of me having done a mistake in the backup config of the first step (which I estimated to be a way higher risk than hardware failure, fire or a data breach at the cloud provider, since I have basically never configured anything right). Being able to easily fetch an individual file from 1 day or 1 week ago can really save time. Edit: also remember that the backup client on Linux uses ram proportional to the backup size so on my cheapo NAS I also outgrew its ram and would have had to get a faster one or a file server, that was also partly why I left it.

The good thing about the backup pricing is that it doesn't increase with aggregated backups like normal storage. It's 150/year so to be competitive you need a few TB, but you can very easily backup your parents machines too and save some time around Christmas... Note though that people sharing the same plan can see each others data (or so I presume)

So ... why not just upload and additional 60GB / 0.05 and then download the entire 60GB which is now 5% of the total storage for free ?

Does anyone have a backup script for backblaze or a similar windows app like SimpleGlacier Uploader?

So, what's the most cost effective way to download all your files from Glacier then?

Spread the download over time. Over 20 months would be free. Over 1 month would be the quoted per-GB charge based on your peak download rate.

If there's a bug in Amazon's libraries, can't you ask for a refund?

Once he was charged the $150, it didn't cost him any more to try again to download it, because of how the pricing works. So, he would have been charged that much no matter what if he'd downloaded the data at that speed, and there was nothing to refund once he successfully got his data out.

I don't get Glacier. It's painfully slow, painful to use and insanely expensive. https://hubic.com/en is $5/month for 10 Tb, with unmetered bandwidth. A far better option for backups.

I'm interested in Hubic, but where do you read that bandwith is unmetered?

Impressive that Amazon can choose to serve a request at 2x the bandwidth you need, with no advance notice, and charge you double the price for the privilege.

Can you explain what you mean here? I'm fairly certain it isn't true.

This is a simple case of spending more than you should have because you didn't understand the service you were using. It's impacted a little worse by how silly the whole endeavor is, given the preponderance of music streaming services.

Uh no, actually this is really a case of a bug in OFFICIAL SDK that caused a higher bill than expected.

No, requesting all his data at once alone determined the rate. The retries didn't cost extra.

I'm surprised that the author had 150GB of Creative Commons audio CDs to begin with!

Don't assume your local laws apply everywhere in the world. (Hint: they probably don't in Finland, which allows private copies)

Are you trying to say that the OP was not allowed to back up their own CDs? I have several boxes of space wasting CDs of my own that have all been ripped in lossless, and feel somewhat insulted by that notion. Note: I live in Australia, a somewhat less insane country for DRM than the US (AFAIK).

CDs don't even have DRM in the general case. However, local laws can still forbid ripping (e.g. here in the UK where after a brief period of legality it is once again illegal to rip a CD).

EDIT removed material that might have derailed things further.

Roger that; I used the DRM term incorrectly, meaning copyright law. Most of my collection was bought in Norway, but I did the ripping a decade ago plus minus a few years in Australia. For the time being we have much better consumer protection here.

I don't even use my collection now that I have Spotify Premium. The only music I've bought lately is some 24bit high bitrate stuff.

Also in Australia, and its not that black-and-white in my understanding:

This is known as "Format Shifting" — taking one copyrighted medium and converting it to another. In Australia, you are explicitly not allowed to do this with CDs, DVDs and Blu-rays.

You are only allowed to keep a digital copy if you continue to retain the original — a backup. If the original is lost or destroyed, your digital copies must be discarded.

For example, you can rip a CD and put it on your iPod, or computer, as long as you continue to own the CD. The issue here is that in both cases you also control the device you are copying it to. You don't control but rather lease space on Amazon's servers — so it introduces a grey area on whether you are allowed to backup to such places and whether putting data on those servers constitutes distribution of the copyright material.

Realistically, none of this is black-and-white and Amazon could flag it as infringing content and remove it just to cover themselves against DMCA complaints anyway. This is true in both Australia and the US, regardless of the differences in copyright law (Australian copyright law offers far fewer protections than the US, incidentally) because both having similar DMCA laws.

Saw this too late to edit my other reply; an interesting overview of what you may or may not do:


Also backed up by: http://copyright.com.au/about-copyright/exceptions/ - which states we're allowed to "space shift" music.

Interesting, thanks for clarifying. Good thing I still have the originals then, as much as I'd like to dump them all.

Two paragraphs in its plain to see he's not talking about backing up his CD,s he's talking about full copies so he can get rid of them.

And then posting this fact to hacker news. Not the brightest bulb in the pack.

I am picturing the following use case:

1) Buys music CD. 2) Rips CD to own computer. Shares with own devices [via personal cloud] for continued listening. 3) Destroys original CD. 4) Continues to listen to music which has been paid for; not sharing files with anybody else.

I for one fail to see the problem here. But then again, I'm probably not the brightest bulb in the pack either.

I have yet to meet a yo e who destroys their originals. Do you? Most people sell or gift them to someone else, which is not only outside of the letter of current copyright laws, but the spirit of those laws as well.

It would be easier to download pirated copies off the internet and exactly as legal.

Mine are all in a box shoved against the wall under my bed, where they don't get it the way but I can prove that I still deserve consideration under the First Sale doctrine.

What country are you thinking of? Backing up CDs is legal in the US.

I don't like how the title and article reads like a hit piece on Amazon Glacier. It's great at what it is intended for. In addition it seems he still saved money because over 3 years because the $9 a month savings added up to more than the $150 bill for retrieval.

I'm surprised that this aspect has not been mentioned here in the comments yet:

> I was initiating the same 150 retrievals, over and over again, in the same order.

This was the actual problem that resulted in the large cost.

At my old job we would get a lot of complaints about overage charges based on usage to our paid API. It wasn't as complicated of pricing as a lot of AWS services, just x req / month and $0.0x per req after that, but every billing cycle someone would complain that we overcharged them. We would then look through our logs to confirm they had indeed made the requests and provide the client with these logs.

> > I was initiating the same 150 retrievals, over and over again, in the same order.

This was the actual problem that resulted in the large cost.

Except that it wasn't. The repeated requests were free, because he already set the maximum rate with the first wave of requests. Surprising?

Also, it really is not a hit piece. It's an honest report of what he did (and what he did wrong) and that he thinks the docs aren't as clear as they could be.

Thanks, I guess I glossed over that part of the article but when I got to the end and saw the original quote I assumed the worst. The title alone is pretty inflammatory in my opinion.

> I was initiating the same 150 retrievals, over and over again, in the same order.

>> This was the actual problem that resulted in the large cost.

That is not true, according to the article itself. The first request for the full 60.8gb already results in a $154.25 bill, regardless of the ones that follow. From that point on he can continue to retrieve 15.2GB/hour for the rest of the month without incurring further costs.

Thanks for clarifying, I guess I missed that part.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact