I'm a long time user of backblaze, and I'm a big fan of the product - it does a great job of always making sure my working documents are backed up, particularly when I'm traveling overseas, and my laptop is more vulnerable to theft or damage.
With that said - Backblaze is optimized for working documents - and the default "exclusion" list makes it clear they don't want to be backing up your "wab~,vmc,vhd,vo1,vo2,vsv,vud,vmdk,vmsn,vmsd,hdd,vdi,vmwarevm,nvram,vmx,vmem,iso,dmg,sparseimage,sys,cab,exe,msi,dll,dl_,wim,ost,o,log,m4v" files. They also don't want to backup your /applications, /library, /etc, and so on locations. They also make it clear that backing up a NAS is not the target case for their service.
I can live with that - because, honestly, it's $4/month, and my goal is to keep my working files backed up. System Image backups, I've been using Super Duper to a $50 external hard drive.
Glacier + a product like http://www.haystacksoftware.com/arq/ means I get the benefit of both worlds - Amazon will be fine with me dropping my entire 256 Gigabyte Drive onto Glacier (total cost - $2.56/month) and I get the benefit of off site backup.
The world is about to get a whole lot simpler (and inexpensive) for backups.
In an earlier version of Arq, there was no possibility to see which data was actually selected for backup (and which not). Has Arq become more user-friendly?
I'm a current user of ARQ, and it has, in my opinion (compared to Backblaze) a phenomenally user friendly description of exactly what is being backed up, and when, and when something was added, or modified.
I think the key here is not to just provide a toggle for using Glacier instead of S3, but to have the historical snapshots migrated to Glacier from S3 and deleted every 90+ days.
Unless your files change a lot, keeping the latest backup version in S3 and previous versions in Glacier would mean that most of your backup data are still in S3 I think. Right?
Just FYI, I too would view Glacier support as something worth a normal upgrade fee. I am using Arq more as a long-term back-it-up-and-forget-about-it storage solution anyway, so it seems like a natural direction to go for a user like me (same strategy, but even lower cost).
That said, I am reminded that I should not forget about the Arq backups and do a few test restores sometime. :)
Wow, yes, please do! I've never used Arq, and at the S3 prices it doesn't make sense on top of my Dropbox. But with Glacier, I'd love to be able to back up my entire HDD off-site for $5/month, and Arq looks like a good way to do that.
I'm a happy Arq customer today and I'd pay for an upgrade to Glacier! Longer term I'm sure you will have competitors on Glacier that you will be competing with so best to move sooner than later.
With large volumes the real issue is not the storage but the upload speed. I've done some experiments and with the Comcast link (20Mbps down/whatever up), I got 1 Gb/hour upload rate. So, it'll take 11 full days to upload 256 Gb. Or, more realistically, if you do it overnight (8 hours) - the entire month.
"Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway. —Tanenbaum, Andrew S." - http://en.wikipedia.org/wiki/Sneakernet
Makes me wonder what Amazon is doing to commoditize its complements, i.e. what it's doing to improve high speed internet access. There are a lot of people in my vicinity who have no option faster than a wireless 3Mbps connection capped at a few GB per day. And this is within easy distance of Amazon's East Coast data center.
What's even more important, you will be able to encrypt your backups without having to disclose the encryption key in case you ever need to restore (client-side encryption and decryption). This is not the case with Backblaze, which is why I switched to CrashPlan — but I'm still looking for other solutions.
You'd prefer that they lie to you and pretend it's not stored in the clear? If you don't have to type it in every time a backup runs, it's in the clear - everything else is just window dressing.
Right, which is a largely meaningless distinction: if the process runs as you, any successful attacker can simply read it out of the CrashPlan process in memory.
Again, if you're not typing the password in every time, a local compromise is almost certainly game-over. Apple's keychain helps reduce the damage if the data's not actively used but for something like CrashPlan which is always running the attacker is probably going to be lucky.
I agree that showing the encryption key in the clear is not a serious security flaw but it's IMHO against best practice.
Passwords and key are usually shown in an obscured form, usually with asterisks, and stored in the user or system keychain. You are absolutely right that the security value of these standard practices should not be overvalued but still … what else within the security framework of CrashPlan is not done in accordance with best practice?
(I assume, BTW, that CrashPlan does not use the system or user keychain on the Mac because it is not a real Mac citizen but a Java-based app. Firefox and Wuala – the latter Java-based too – don't use the user or system keychain either.)
Those arguments are typically made from the wrong perspective. It's just so common that if you don't do it, your product will be perceived as insecure. And as long as actual security is not horrible, the perception of security is what drives sales, not the actual security.
CrashPlan defaults to using the account password as the encryption password.
However, you can also secure the encryption with a password not associated with the account. Or even provide your own 448-bit key. If you do either of these options, CrashPlan support will not be able to help you
This setup allows CrashPlan to easily help non-technical home users, while allowing technically savvy users to securely hang themselves with their own encryption.
My question is not about the setup, that's OK. I am wondering why CrashPlan shows the encryption key in the clear and does not store it in the user or system keychain.
As others have noted, it is partially because Crashplan is a Java based app. Partially, it is also because Crashplan runs as System, not as the user. That way, I can have Crashplan backup my wife's user account and my user account.
Furthermore, you can use the encryption key + a custom password, or your own encryption key with a passphrase. In this case, it is encyrpted locally and the key is not sent to Crashplan[1].
I'm assuming this because their desktop client is a java app and it it's been made to run on linux, mac and Windows. Is there such a thing as a keychain on Windows? I've only used it on linux and mac.
Then no, it does not worry me. Obscuring password entry in an application that I almost never run is not a problem. Plus, if I do actually run the app and want to enter the password, I don't do it with a bunch of people peering over my shoulder.
The other feature that's great about CrashPlan is that it will allow you to backup to other systems running their software. I backup my laptop to both their service and one of my dedicated servers. That way if anything ever happens to their online service I have a second remote copy that I control.
Backblaze does support the optional use of a separate encryption key which they claim to never store, but there's no information about whether it is ever disclosed to them or not. There's little information on their website, but it can be set in Settings > Security in the Backblaze client.
Home-use is probably the only situation Glacier is good for backup though.
A home user is fine with a 3.5-4 hour window before their backup becomes available for download (as it will probably take them days to download it anyway).
In a corporate environment, I don't want to wait around for 3.5-4 hours before my data even becomes available for restore in a disaster recovery situation.
Seems good for archive-only in a corporate environment (as the name implies).
Interesting to note - those of us who used Iron Mountain/Data Safe for years - 4 Hours was considered a "Premier Rapid Recovery" service that we paid a lot of money for (as recently as 2003, actually).
In a true disaster recovery (Building burned down or is otherwise unavailable) - it usually takes most businesses a week or so just to find new office facilities.
But - agreed, there will be some customers for whom Glacier wouldn't work well for all use cases.
Now - a blended S3/Glacier offering might be very attractive.
Yes, that's what I am thinking: a week of backups to S3 and a script that does moves the oldest one from S3 to Glacier. You would rarely need a backup older than a week if you've already got a week of backups. I'm still figuring out if there is a practical way to do this with incremental backups without introducing to much risk that the backup process gets messed up.
"In the coming months, Amazon Simple Storage Service (Amazon S3) plans to introduce an option that will allow you to seamlessly move data between Amazon S3 and Amazon Glacier using data lifecycle policies."
It's made for archiving, not 'backups'. You're expected to keep most recent backups on-site, so if a RAID array dies or you happen to do something wrong, it just takes minutes to restore. Off-site, or tape archive, it's usually another thing. Most of the data is never, ever accessed (but required by company policy to be kept).
In a corporate environment, I wouldn't want to depend on the cloud as my primary backup solution in the first place. I'd be much more comfortable using it as the offsite mirror of an onsite backup. If you're at a point in disaster recovery where you have to restore from your offsite, you (likely) have bigger problems than a 4-hour wait time.
I personally believe that data should never be deleted (or overwritten), but only appended to. Kinda like what redis/datomic does. So, keep live data onsite, along with an onsite (small) backup, and all the old data in Glacier.
You can believe that, but legal realities dictate otherwise. There are certain classes of information that you are not permitted to keep beyond a defined horizon, either temporal or event-based. Legal compliance with records management processes means having the ability to delete or destroy information such that it cannot be recovered. Note that if the information is encrypted, you can just delete the decryption key and it is effectively deleted.
That's just the beginning - if you've ever been in a ediscovery process, having large amounts of historical data is actually a liability - if instead of 100GB you have 10TB, you'll need to hand that over, and before that, to cull it so you don't inadvertently hand the opposition a huge lever to be used against you. Processing and reviewing 10-100x the data can take inordinate amounts of time more than you expect.
"In a corporate environment, I don't want to wait around for 3.5-4 hours before my data even becomes available for restore in a disaster recovery situation."
I am guessing you work in a company with a good IT department then, I am guessing this is not the average. Many companies I have worked for 4 hours would be a miracle with a 1-3 day operation minimum.
And don't forget that the X00MB type size limits many IT departments puts everywhere it not because getting a TB hard drive is cheap, but because all of the extra backups add to the cost of each new MB. Having another extremely cheap way they could backup large amounts of data (encrypted?) would help to reduce the cost of each extra GB.
I don't think DR is the correct use-case. I think it more likely to be (as in the case of banks which I know best) a regulator asking for electronic documents or emails relating to a transaction from 6 years ago (seven years is the retention requirement).
In that case, 3-4 hours would be more than acceptable.
Here, we are producing logs of simulation runs that may be needed in the next couple of years. Most of the logs will be destroyed unused. This is a perfect use case for glacier. We do not care about 5 hours recovery time.
In a corporate environment where you have low RTO goals for DR scenarios, relying on backups instead of replicated SAN etc is not a sound practice, especially with large data sets.
I use s3cmd sync to keep my files backed up.
But for my pictures, tunes, I use a Linux server and rsync. Now glacier is perfect for my tunes and family pictures.
This is a really good offering for media that you typically will keep locally for instant access, yet you want to have an off-site backup in a way that lives for a very long time.
Dropbox should work here, but it's simply too expensive. My photo library is 175GB. That isn't excessive when considering I store the digital negatives and this represents over a decade.
I don't mind not being able to access it for a few hours, I'm thinking disaster recovery of highly sentimental digital memories here.
If my flat burns down destroying my local copy, and my personal off-site backup (HDD at an old friends' house) is also destroyed... then nothing would be lost if Amazon have a copy.
In fact I very much doubt anyone I know that isn't a tech actually keeps all of their data backed-up even in that manner.
I find myself already wondering: My 12TB NAS, how much is used (4TB)... could I backup all of it remotely? It's crazy that this approaches being feasible. It's under 30GBP per month for Ireland storage for my all of the data on my NAS.
To be able to say, "All of my photos are safe for years and it's easy to append to the backup.". That would be something.
A service offering a simple consumer interface for this could really do well.
sigh Dropbox should not be used as a backup system. A system that synchronizes live should never get that role, except if they can guarantee that old data is never overwritten, new is always appended. This is not the case with dropbox - I've experienced multiple scary occurrences of old versions going to the nirvana with certain user actions. In certain cases the old data just appears to be gone, in other cases the web interface shows them but a restore results in an error message. Especially data move appears to be buggy. Dropbox appears to be simple, but the backend processes are really not and there is too much going on for it to be a reliable backup system, especially if you also share some of your data to team mates.
No sigh was necessary, I understand. I even mentioned that I have a NAS and off-site backup... so either you didn't read it all or you stopped the moment you encountered the word "Dropbox" and started typing.
That said, people DO use DropBox as backup.
If you take a walk around the British Library and asked every PhD student working there how they "Backup" their research and work in progress, I bet every single person who believes that they have a backup will say "Dropbox", and the only exceptions will be a few who don't really have a backup.
I know that because I ensured my girlfriend does have a real backup solution in place that is tested. Not one of her peers seems to.
DropBox is used for backup because they've made file sync so damn easy that most people can be convinced that if a file exists in many places, it is backed-up.
My whole point is that now storage for long term backup is priced in a way that is affordable to most, that consumer services may emerge that offer true backup to consumers and can successfully migrate people from lesser solutions (DropBox, stacks of CD-ROMs, etc).
One of the things about backup is that it needs to be easy. Currently the size and cost of backups make it expensive, and the only way to reduce the cost makes it difficult (HDD local copies stored at a friends' house for example).
By reducing the cost, perhaps we can finally increase the ease... and then a day may come in which most people have a real backup solution.
"I even mentioned that I have a NAS and off-site backup... so either you didn't read it all or you stopped the moment you encountered the word "Dropbox" and started typing."
No offense, but this post doesn't quite match what you wrote originally. My sigh was in response to the phrase "Dropbox should work here, ...". You didn't state security as your concern as to why not use Dropbox, rather it was cost. This might lead someone who only needs <2GB backed up to believe that Dropbox is perfectly fine for that task.
"DropBox is used for backup because they've made file sync so damn easy that most people can be convinced that if a file exists in many places, it is backed-up."
And that's exactly what I'm scared about and why I'm saying it again and again that you shouldn't do it - if only one person listens and avoids potential data loss because of it I've already reached my goal.
"One of the things about backup is that it needs to be easy. Currently the size and cost of backups make it expensive, and the only way to reduce the cost makes it difficult (HDD local copies stored at a friends' house for example). By reducing the cost, perhaps we can finally increase the ease... and then a day may come in which most people have a real backup solution."
Agreed. IMO no consumer backup system is quite there yet. TimeMachine is very close, if only it would do better logging and have some more intelligence about warning messages.
"No offense, but this post doesn't quite match what you wrote originally. My sigh was in response to the phrase "Dropbox should work here, ...". You didn't state security as your concern as to why not use Dropbox, rather it was cost. This might lead someone who only needs <2GB backed up to believe that Dropbox is perfectly fine for that task."
A fair point.
In my case, Dropbox use is in addition to local RAID (scratch) NAS (network scratch, access of larger files) + off-site (backup).
I only use Dropbox for syncing and sharing.
The sync vs backup is an interesting one, simply because most consumers couldn't tell you the difference.
For example: Q: "Are your contacts backed up?". A: "Yes, they're sync'd to Google".
I did conflate my scenario with thinking about my girlfriend's peers in my post. And then reacted from my perspective again... my bad.
"Agreed. IMO no consumer backup system is quite there yet. TimeMachine is very close, if only it would do better logging and have some more intelligence about warning messages"
Vigorous agreement here too, except for the TimeMachine bit as that is Mac only and doesn't work for <insert any other system or device that isn't Apple Mac OSX).
"Vigorous agreement here too, except for the TimeMachine bit as that is Mac only and doesn't work for <insert any other system or device that isn't Apple Mac OSX)."
Well yes, obviously it only works in a mac-only household/office. I still think it's a good solution for non technically experienced users who only have macs since it's so simple that you could literally explain your grandma how to setup. I don't think mobile devices are that important on the other hand. The important data on them should usually be synched to your computer and as long as that is backuped, you should be fine. On windows I think that the built-in backup since Windows 7 is finally decent, albeit not yet grandma-proof ;).
To be fair though, Dropbox is way better than any other backup solution consumers usually use. Personally, I've never seen data loss occur on Dropbox, but I'm sure it can happen - it's just way less likely than the average user messing up their own backup.
The ideal solution is the Quadfecta (is that a word?) - Dropbox for (in my experience) excellent versioning/synchronizing (Never failed me) + Backblaze (or its ilk) for continuous Offline backups + Super Duper (weekly/whenever) - for Image Backups - + Something (Arq?) on top of Glacier for long time off-site-archival.
For $50 in software (Arq+SuperDuper), $100 for an external HD, and less than $25/month ($4-backblaze, $10 Dropbox, $10 Glacier) you have a backup system that is next to air tight for a Terabyte of Data and a working set (on dropbox) of 100 Gigabytes.
Dropbox stores their stuff in S3 a little different. It's not a 1:1 correspondence between user files and objects under Dropbox's S3 account. The fact that they use S3 as their backing store means very little. It certainly sounds good to have S3 in back when you talk about scalability and durability, but 1) they could just as easily use something else, and 2) depending on their sharding strategy, a single lost object could impact multiple files at the user level.
Right - the point I was trying to make, is he was putting all his eggs in one basked. If anything catastrophic happened to S3, he might lose both his S3 as well as his Dropbox backups.
If you are going to the effort of having dual-backup systems, may as well try and find something that can't be impacted by a single disaster.
And yes, I might have daily-monday / daily/tuesday.... and so on, and it would work better, but this way works for me. There is a lot of room for improvement, and it is not hard to implement with cron-jobs and more buckets.
Let's say your data loss occurs on the Sunday, 31th of a month at 23:55 and then gets synched across all your S3 backups (or it could occur at some point before that but you don't notice it). And poof goes your data.
THIS. I'd say at least half of the time I went to retrieve a good revision of a file from backup, it was already too late and the backup was trash as well.
You could try adding objects with the date in their names to a 'backup' bucket rather than create separate buckets for each backup. S3 also supports object expiration, so you can set your daily backups to expire after 7 days and weekly backups to expire after 4 weeks, for example.
Though you make a good point, your comment could have avoided sounding pretentious if you shortened it by removing the first few words.
Also, please don't use uppercase for emphasis, as mentioned in the guidelines [1]. If you want to emphasize a word or phrase, put asterisks around it and it will get italicized.
Alright, I've edited it, thanks for the heads up. Dropbox and backup in one phrase tends to awaken my temper as I've had some horrible almost data losses (as in I've had to restore data from the dropbox cache folder that could basically be purged at any point).
Yes, I'm aware of that. This helps, up to the point where you don't notice that something's gone for 'a few days'. Let's say you have your student project's folder shared with two other colleagues. You take a few days off, use your PC for casual browsing, meanwhile your colleagues are working on the project. At the end, one of them (who doesn't quite understand how dropbox works) deletes the files while the other is still working on it and dropbox gets confused. Your PC gets synched at a moment where you don't notice. Come back from your holidays and you will have a nice surprise waiting.
It doesn't matter whether he notices it or not, you can't count on social factors when it comes to backup systems. He might notice it but then he thinks, oh no problem, it's all in Dropbox anyway, I'll talk to Dylan16807 when he gets back.
I've never had any complete data losses until now, but I've had this situation where the Dropbox cache was the only place to retrieve files twice and it showed me that it can't be trusted this way. I'm certainly not gonna wait until the real deal happens just to have a personal story on how Dropbox can go horribly wrong. I've been shown the possibility and that's enough for me.
As for specifics, the case I mentioned is the one I saw where it can go wrong. IMO the only really save way to use Dropbox without additional backups is if you never share folders and only ever use one system with write-access at the same time - which is not the usual use-case of that product.
I just want to understand better why you had to resort to the cache. The only time I've ever done that was when I accidentally deleted a bunch of data and didn't want to restore each file by hand or put in a support ticket.
I can't quite put my foot down on what the scenario is exactly. One time might have been a case where a folder with lots of small files got moved away and Dropbox only allows single file restore, which would have taken hours, possibly longer than the lost work. Once I've done something stupid where I wanted to exchange a whole Dropbox folder with a different version. I switched off the client on one PC (by mistake, obviously) and exchanged the folder on the other, then switched it on again. Then I became aware the the old folder had important stuff in it. The versioning was corrupted at that point, many old files would either not show up online or would produce an error message.
So usually there is some user behaviour involved, yes. But the whole point of a backup system is that you can rely on it, even when the user behaves stupidly up to a certain degree. Sync is for day to day collaboration and data management, it's not for backup.
If dropbox works well for you for source control, that is great. But frankly, if you get to the point where you have to start writing hacks to keep it working, it is probably time to move onto something designed for the task.
>A system that synchronizes live should never get that role, except if they can guarantee that old data is never overwritten, new is always appended. //
Why don't Dropbox enable this use-case. It would surely be very easy to implement. I guess that pack-rat does do this in a way but seems like overkill.
I'd like something along the lines of duplicating all files but requiring confirmation of deletions and overwrites.
I currently just backup my digital photos (~20Gb) to S3 via replication from my QNAP NAS... works out about $3 per month... I'll probably use the option they mention here to auto move content to Glacier from S3 coming soon...
"In the coming months, Amazon Simple Storage Service (Amazon S3) plans to introduce an option that will allow you to seamlessly move data between Amazon S3 and Amazon Glacier using data lifecycle policies."
They store the last 30 days worth of versions of every file you modify. Dropbox could keep versions from the last 10 days in S3 but move the rest to Glacier. Restores for older versions wouldn't be instant, but if S3 storage is a nontrivial expense for them moving the bulk of previous versions to Glacier would cut down on costs.
Deleting data from Amazon Glacier is free if the archive being deleted has been stored for three months or longer. If an archive is deleted within three months of being uploaded, you will be charged an early deletion fee. In the US East (Northern Virginia) Region, you would be charged a prorated early deletion fee of $0.03 per gigabyte deleted within three months.
So I guess that means it would still work well for a scheme like time machine uses, where incremental changes are added but deletions are simply made note of. At least I think that's how it works.
Storage experts: I'd love to know more about what might be backing this service.
What kind of system has Amazon most likely built that takes 3-4 hours to perform retrieval? What are some examples of similar systems, and where are they installed?
There'll be a near-line HDD array. This is for the recent content and content they profile as being common-access.
Then there'll be a robotic tape library. Any restore request will go in a queue annd when an arm-tapedrive becomes free they'll seek to the data and read it into the HDD array.
Waiting for a slot with the robot arm - tape drive is what will take 4 hours.
Close. Tiered, yes. But remember who we're talking about.
First, no tape. The areal storage density of tape is lower than hard disks. Too many moving parts involved. Too hard to perform integrity checks on in a scalable, automated fashion without impacting incoming work.
Second, in order to claim the durability that they do (99.999999999%), that means every spot along the pipe needs to meet those requirements. That means the "near-line HDD array" for warm, incoming data needs to meet those requirements. Additionally, if the customer has specified that the data be encrypted, it needs to be encrypted during this staging period as well. It also needs to be able to scale to tens if not hundreds of thousands of concurrent requests per second (though, for something like Glacier, this might be overkill).
They've already built something that does all that. It's called S3. The upload operations likely proxy to S3 internally (with a bit of magic), and use that as staging space.
Wouldn't there also need to be a lot of logic to prevent fragmentation? You'd probably want data from one user near other data from that user, i.e. on the same tape.
I'd guess that they ignore that problem and have baked the time it takes to get data from several tapes into the 3-4 hour estimate.
If you think about it, writes are more common than reads on average, so it's more efficient to just write to whatever tape is online and deal with the fragmentation problem on the read end, as opposed to queueing writes until the 'correct' tape can be brought online just save some time reading. Also in backup situations like this, it's more important to get the backup done in a timely manner.
The multiple-hour window could give you a lot of wiggle room here though. It's unlikely to take 3 hours to restore from a single tape, so even if you have to visit 2-3 tapes then you have plenty of time.
I'm sure that there is a general tiered storage platform (as mentioned above) which keeps some of the data online as well. That would let you run a "defrag" algorithm later if you find you need it.
Could be for example some tape robot where you can have huge amounts of tapes in the storage but only have few devices for reading/writing them. With tape you can't really stream the data to the web. Instead you would probably first copy it somewhere. If there is lots of data, say few terabytes, even this process takes some time.
Or in case they are using regular hard drives, you might want to have this kind of time limits in order to pool requests going to a specific set of drives. This would enable them to power down the drives for longer periods of time.
The 3-4 hour estimate may also be artificial. Even if you can in most cases retrieve the data faster, it would be good to give an estimate you can always meet. They might also want to differentiate this more clearly from standard S3.
And we should not forget that it does take time to transfer for example one terabyte of data over network.
Take a look at Linear Tape File System (LTFS), which allows for ad-hoc file retrieval. CrossRoads Systems out of Austin has the leading implementation, and they did bid on this Amazon contract. I have no idea if they won it, though (next corp conference call is Aug 29th). They just closed a joint investment with Iron Mountain, so my money is on an LTFS solution with CrossRoads as a systems vendor.
"Paying $12 to store a gigabyte of data for 100 years"
I'm not sure what kind of organisation I'd actually trust to store data for that length of time - a commercial organisation is probably going to be more effective at providing service but what commercial organisation would you trust to provide consistent service for 100 years? A Swiss bank perhaps? Governments of stable countries are obviously capable of this (clearly they store data for much longer times) but aren't set up to provide customer service.
The Royal Mint has existed for 1,100 years. That's pretty much the most stable government-owned business-like entity I can find.
The Stora Kopparberg mining company has existed since it was granted a charter from King Magnus IV in 1347.
A few banks tend to last for a long time [1]. Banca Monte dei Paschi di Siena has existed for about 540 years.
Beretta, the italian firearms company, has existed for 486 years (and has been family owned the entire time).
East and West Jersey were owned by a land proprietorship for around 340 years starting from King Charles II bestowing the land to his brother James in 1664. [2]
At first I thought multinational corporations would be more stable because they could move from land to land to avoid wars and such. But apparently they haven't lasted nearly as long as their single-nation counterparts.
The Knights Templar were granted a multi-national tax exemption by Pope Innocent II in 1139, and lasted almost 200 years until most of their leadership was killed off in 1307.
The Dutch East India Trading Company was one of the first [modern] multinational corporations, spanning almost 200 years from 1602-1798.
However, the longest-lasting companies have been family owned and operated. [3] [4]
It appears most all companies that have lasted a long time are due to two factors: dealing in basic goods and services that all humans need, and looking ahead to the future to change with the times.
> Governments of stable countries are obviously capable of this
I don't consider that obvious. I live in Berlin, the capital of what most would consider a stable country, but my apartment (which is even older) has been a part of 5 different countries in the last 100 years (German Empire, Weinmar Republic, Nazi Germany, East Germany and finally, the Federal Republic of Germany).
Sorry, what I meant by "stable" there is countries that have been relatively stable for a few hundred years and seem reasonably likely to continue that integrity for at least a century or so.
Of course, predicting future stability is complete guesswork!
Even the US had a close call with a fairly nasty civil war in that time frame, and in three days the 198th anniversary of the Burning of Washington happens: http://en.wikipedia.org/wiki/Burning_of_Washington "On August 24, 1814, after defeating the Americans at the Battle of Bladensburg, a British force led by Major General Robert Ross occupied Washington, D.C. and set fire to many public buildings. The facilities of the U.S. government, including the White House and U.S. Capitol, were largely destroyed."
The Thai king is the longest-reigning current head of state, ascending the throne on 9 June 1946. Elizabeth II of England is 2nd, 6 February 1952.
The oldest country (not government) is likely Vietnam (2897 BCE). Other contenders: Japan (660 BCE), China (221 BCE), Ethiopia (~800 BCE), or Iran (678 BCE).
Few of today's modern states pre-date the 19th Century, many antedate World War II or the great de-colonialisation of the 1960s including much of Africa and Oceana (some of the longest inhabited regions of Earth).
Among the more long-lived institutions are the Catholic Church (traditionally founded by Jesus ~30 AD, emerging as an institutional power in 2nd Century Rome). The oldest company I can find is Kongo Gumi, founded in 578, a Japanese construction firm. The record however is likely held by the Shishi Middle School founded in China between 143 and 141 BCE.
My own suggestion would be the Krell, though some might disqualify this based on a requirement for human organization.
That was my thought as well. However it spent a great deal of time under foreign rule: under the Greeks and Romans, the Turkish / Ottoman empire, and later under British occupation. And, I just discovered what boxer Muhammad Ali's referent was.
We are straying a bit off topic here. I don't think any country has been stable for a "few" hundred years. A few is 3. This is before the US was founded.
I would say a stable country is one which has had a legit democracy for 70 or so years and doesn't share a border with a non-democratic / non-legit-democratic state. These two points suggests its unlikely to have a revolution or be invaded any time soon.
Of course, if you look at the UK governments track record with IT.. you wouldn't trust it. I would say the same with the US, especially in concern with data security.
I think I could argue that England has been pretty stable since at least the end of the Civil War - which is 360 years ago. A lot of the institutions that form part of the current UK go back an awful lot further than that.
I'll reject your democratic requirement out of hand, as democracies haven't proven particularly stable. There was that trial-run in Athens which lasted 501 years (508 - 7 BCE, with interruptions). Other than a few small/outlying instances (most notably the Althing in Iceland, it didn't re-emerge until the short-lived Corsican Republic (1755), and of course, the United States (1776).
Japan (660 BCE) and China (221 BCE) have both had feudal / bureaucratic governments exhibiting very high levels of stability. While dynasties and eras are marked, the overall states persisted largely intact.
No offense, but Germany is far from what I'd call a stable nation. It has, in fact, only been a nation for 141 years. and during that time has, as you've noted changed governments many times.
There are some pretty old commercial organisations out there. One list [1] has several that have existed for over 1000 years. Whether they would be capable of storing data (or would even be interested in doing so) is hard to say, but there have at least demonstrated that long-term organisational continuity is possible, and that presumably requires some organisational 'memory'.
Having a long history obviously isn't a predictor of future stability. According to the Long Now Foundation site [2], a Japanese company that existed since 578CE went bust in 2007.
But, it has not been as stable as you might think. During the 1300s, the popes resided in France.
As recently as 1870, during the Italian unification, the church was stripped of its power to govern Rome after an armed confrontation between armies at the gates of Rome ("XX Septembre"). During this period, the Pope (the ninth Pius) seriously considered fleeing Italy.
If you really want the organization to survive for a long term, you need to make it a religion.
If you can figure out a way to convince a few dozen people every decade that the best way to glorify God is to isolate themselves off somewhere maintaining your archival data, you'll be set for centuries.
Yeah, I'm kind of wondering the same thing. It's certainly the kind of timeframe that changes your perspective. Maybe a tiny bit of Danny Hillis rubbed off on me from working at Applied Minds (man, I sure hope so!)
Because as we answer issues of cost and availability, a logical thing to wonder is "how long can I really depend on it though?" As quickly as cloud services (where "lifetimes" are measured at six years) have entered our economy, that's a question begging to be answered.
Amazon at least seems to be an "eventually durable" datastore, though. Meaning that if you are told in the future that it will go offline, you have an excellent chance to make other suitable arrangements. Say there's a 0.01% chance of this product being discontinued next year, up to a 10% chance 5 years from now. I have to think there will almost certainly be other services you can move your data to, on similar terms, for a long time.
That's assuming you're around at all, and nothing reeeeeally bad happens even so. Making data last after your death (or even after you stop paying!) is a lot harder in this environment, and achieving true 100-year durability is a tough nut indeed.
I like your bank idea, since the preservation of a bank account is just a specialized simple case of data preservation. Data preservation seems rather more reliable when the data is directly attached to money. Then again, maybe banks themselves are on their way out for this purpose — Dropbox could become the new safety-deposit box.
But there are 100-year domain registrations, after all. Maybe we're ready for organizations to at least offer 100-year storage, too.
And, since you mentioned Danny Hills, it's also worth mentioning that (ironically?) Jeff Bezos is one of the principle supporters of http://longnow.org/clock/.
And that makes me think of Anathem and the potential issues around long-term data storage that is capable of surviving through falls of civilisations and/or sacks of storage areas.
Probably what would be required is an array of arrays of separate storage providers and services providing "RAID" on top of these storage providers - and you won't want to trust any of these you'll want a few of them... (hence the array of arrays).
leastauthority.com (the Tahoe LAFS folks) are trying to promote the concept of "RAIC" (C=Cloud). I'm not sure what the status of the project is, though...
FWIW, we are happy to support this kind of use, and see our customers doing it in an ad-hoc way every day. We have s3cmd in our environment, and support it. As soon as their is a glacier complement to s3cmd we will put that into place as well, although with the strange traffic and retrieval pricing, I'm not sure how useful folks will find it ...
If you really wanted to preserve something, I'd consider printing it on gold leaf. One gram of gold turns into about 4 sq ft, and assuming we can print at 1200dpi, we'd have 4 x 144 x 1200 x 1200 bits, or about 100 megabytes. So you'd need about ten grams, at a price of $530 or so, plus storage. Though you could just bury it in your backyard.
> Paying $12 to store a gigabyte of data for 100 years seems like a pretty intriguing deal as we emerge from an era of bit rot.
As long as that data is decode-able and more importantly, find-able (out of all the GBs frozen for 100 years, why would you want to look at any particular one of them?).
As long as that data is decode-able and more importantly, find-able (out of all the GBs frozen for 100 years, why would you want to look at any particular one of them?)
I'd store my pictures there. Finding old pictures of grandparents when they were little, or even older stuff, is amazing. Wouldn't it be cool if my descendants could still look at pictures of my family in 100 years?
Provided that downloading from this 100 year store is something I could do X times per year, and so long as I could append more data to it over time, it's an interesting business model.
To be fair, if improvements in hardware and software continue at the rate they have been, or some moderate percentage thereof, in 100 years it will be no problem to trawl a few exabytes of data for anything interesting.
There's a blog that's analyzing Geocities, that's about 1 terabyte of 1 KB files. http://contemporary-home-computing.org/1tb/ The analysis tracks changes in template design, follows modifications to logos and gifs, and unearths collections of shrines to dead children etc.
But that's from when it was harder to make and upload data, so people only put meaningful (to them) stuff online. These days we'd have a hundred thousand copies of a few popular MP3's and everyone's crappy digital photos. The percentage of meaningful stuff would be a lot lower.
This seems like a very interesting business idea. It'd require some level of initial operating capital and a relatively competent server farm team, but I don't think it'd have to be fancy.
"Long Data, LLC... We secure your data for the long-term".
What I find fascinating with Amazon's infrastructure push is the the successful 'homonymization' of their brand name Amazon.
Amazon simultaneously stands for ecommerce and web infrastructure depending on the context. e.g "Hey I want to host my server".. "Why don't you try Amazon". "Do you know where I can get a fair priced laptop?" "Check Amazon".
Is there any other brand that has done this successfully?
Virgin (200+ businesses operating or having operated, under the Virgin brand, ranging from infrastructures - trains, airlines - to records, banking, bridal saloons under the name Virgin Bride...).
Mitsubishi and Samsung springs to mind as two of the best known ones internationally where their brands are known in multiple markets internationally, though many of their businesses are less known outside Asia (e.g. Mitsubishi's bank is Japans largest). Any number of other Asian conglomerates.
ITT used to fall in that category back in the day: Fridges, PC's, hotels, insurance, schools,telecoms and lots more. I remember we at one point had both an ITT PC and fridge. The name was well known in many of its markets.
The large, sprawling, unfocused conglomerate have fallen a bit out of favor in Europe and the US. ITT was often criticized for their lack of focus even back in the 80's, and have since broken itself into more and more pieces and renamed and/or sold off many of them (e.g. the hotel group is now owned by Starwood).
Musical instruments (especially wind instruments) and motorcycles share a lot of similar design principles. Ever compared a saxophone to a WWII-vintage motorcycle engine?
In India, the Tata and Birla groups come to mind. The Tatas[1] particularly so. Both are family owned conglomerates and have been around before India gained Independence from the British.
The Japanese Keiretsu[2], from what I have read, is similar. The companies are somewhat loosely connected, but are connected nevertheless.
This actually confused me when I saw the headline. I saw 'Amazon' and, despite having been working with AWS all night, thought ecommerce first and couldn't figure out what they could name 'Glacier'. I was kind-of-not-really hoping it was going to be a new shipping option guaranteed to take a long time. That said, the actual service looks solid.
I know a lot of people seem to have jumped on the backup options of Glacier here and, whilst there is some potential for home users to make use of this product for back, that is not what Glacier is intended for.
Glacier is an archive product. It's for data you don't really see yourself ever needing to access in the general course of business ever again.
If you're a company and you have lots of invoice/purchase transactional information that's 2+ years old that you never use for anything, but you still have to keep it for 5 - 10 years for compliance reasons, Glacier is the perfect product for you.
Even its pricing is designed to take into account that the average use case is to only access small portions of the total archive store at a reasonable price (5% prorated for free in the pricing page).
For many users, though - they will never use the restore capability. And for those who do, with Backblaze, they'll usually get a FedEx of a Hard Drive - so Recovery time is measured in about a day or so. I wouldn't downplay the consumer backup/restore angle so quickly - for many (most?) consumers, the ability to restore rapidly is balanced by their desire to have low monthly payments. I think we're going to see a lot of consumer backup applications built on top of Glacier in the next several months that will be competing with (the already excellent) backblaze and friends. (Note - Backblaze has excellent real-time restore, with date versioning, for those of us who use it as an online data recovery tool as well)
This is fantastic. I've long searched for a solution like that. This is really suitable for a remote backup that only needs to be accessed if something really bad happens (i.e. a fire breaking out, etc). I'm a lone entrepreneur, so I do have backup hard disks here, but being able to additionally save this data in the cloud is great.
I'm often creating pretty big media assets, so Dropbox doesn't necessarily offer enough space or is - for me - too expensive in the 500gb version (i.e. $50 a month).
Glacier would be $10 a month for 1 terabyte. Fantastic.
> Glacier would be $10 a month for 1 terabyte. Fantastic.
+ the $120 or so per TB to transfer it outside of AWS if you need the whole thing back as fast as possible. Still likely to be very cheap as long as you treat it as a disaster recovery backup, though. Will definitively consider it.
(an alternative for you is a service like Crashplan, which also allows you very easy access to past file revisions via a java app and can be very cheap and also allow "peer to peer" backups with your friends/family; the downside with Crashplan is that it can be slow to complete a full initial backup to their servers or to get fully backed up again if you move large chunks of data around)
The only issue I see is that verifying archive integrity (you don't want to find out the archive was bad after you lost the local backup...) would be somewhat complicated, given their retrieval policies. Also, the billing for data-transfer out plus peak retrievals sounds so convoluted, I can't begin to work out what a regular test-restore procedure would cost me. Nevertheless, it's some exciting progress in remote storage!
They could provide salted hash verification: send some salt, get a list of files with SHA1(salt | filedata) via email some hours later (so they can do the verification as a low priority job).
The salt is used to prevent amazon from just keeping the hashes around to report that all is well.
To avoid abuse, restrict the number of free verification requests per month.
Or just build the verification into the storage system, and send a SNS message if data loss has occurred (just like what happens when a Reduced Redundancy object has been lost in S3).
Agreed - I'm really happy about this. I have a home NAS solution that is a few TB and it's too expensive to store on S3. This is perfect to prevent the "house burned down" scenario on very large storage devices!
I recently heard about a startup (spacemonkey) that will be offering 1TB with redundancy and no access delay for $10/month. The way they do it is really clever as well. Cloud storage has always seemed way too expensive to me, but these lower prices have me re-evaluating that.
I'm not sure if cost is right. Each project I work on is approx. 50-60 TB in size (video). Recent one got backed up on 20 LTO 5 tapes times three. That's $600 for tapes per project. Each tape set went to a separate location - two secure ones for about $20/year and one at studio archive for immediate access, if needed. I find this method extremely reliable and it cost ~$700 initially to back all up and virtually non existent further fees. With Glacier it would cost $600 per month.
You have a pretty niche (but interesting!) use case. An LTO-5 Tape can store 1.5 Terabytes Raw [1] - Call it 2 Terabytes with a bit of compression (your Video probably doesn't losslessly compress at 2:1). 60 Terabytes requires 30 Tapes - around $15/month to store at Iron Mountain. [2] . The Glacier Charge for 60 Terabytes is $600/month vs $15/month for Tape Storage.
Also - upload/recovery times are problematic when you are talking 10s of terabytes. Right now, the equation is in favor of archiving tapes at that level (Even presuming you store multiple copies for redundancy/safety).
Glacier is for the people wanting to archive in the sub-ten terabyte range - they can avoid the hassle/cost of purchasing tape drives, tapes, software - and just have their online archive.
The needle will move - in 10 years Glacier might make sense for people wanting to store sub 100 Terabytes, and tapes will be for the multi-petabyte people.
Interestingly they penalise you for short-term storage:
Amazon Glacier is designed for use cases where data is retained for months, years, or decades. Deleting data from Amazon Glacier is free if the archive being deleted has been stored for three months or longer. If an archive is deleted within three months of being uploaded, you will be charged an early deletion fee. In the US East (Northern Virginia) Region, you would be charged a prorated early deletion fee of $0.03 per gigabyte deleted within three months
After reading tezza's explanation [1] of how they're probably using tape storage, this makes sense; Amazon wants the mechanical robot arm to spend the majority of its time writing to the tapes. If you're constantly tying it up with writes/deletes, you're taking time away from its primary mission: to archive your data. Charging you for early deletes discourages that practice.
There are any number of reasons why deletes would be discouraged. One is packing: if your objects are "tarred" together in a compiled object, discouraging early deletes makes it more cost-effective to optimistically pack early.
I had a quick skim through the marketing stuff and the FAQs and didn't see anywhere that actually details what the backend of this is. I'd be curious if they're actually using tape, older machines, Backblaze pods, etc. I guess if it's the latter, the time to recover could be an artificial barrier to prevent people from getting cute.
It appears to use S3 as its basic backend. My guess is that S3 has been modified to have "zones" of data storage that can be allocated for Glacier. Once these zones have been filled with data (and of course that data is replicated to another region) the hard drives are spun down and essentially turned off.
This is why the cost of retrieval is so high: every time they need to pull data the drives need to be spun back up (including drives holding data for people other than you), accessed, pulled from, then spun back down and put to sleep. Doing this frequently will put more wear and tear on the components and cost Amazon money in power utilization.
As is Glacier should be extremely cheap for AWS to operate, regardless of the total amount of data stored with it. Beyond the initial cost of purchasing hard drives, installing, and configuring them the usual ongoing maintenance and power requirements go away.
Someone further up mentioned a very plausible (in my experience) answer. Magnetic tape, using hard drive arrays as RAM. The wait time in this situation would be the time needed to complete all the current tasks waiting to be written/read in the queue before your data is written from tape to hard drive so you can access it.
The retrieval fee for 3TB could be as high as $22,082 based on my reading of their FAQ [1].
It's not clear to me how they calculate the hourly retrieval rate. Is it based on how fast you download the data once it's available, how much data you request divided by how long it takes them to retrieve it (3.5-4.5 hours), or the size of the archives you request for retrieval in a given hour?
This last case seems most plausible to me [6] -- that the retrieval rate is based solely on the rate of your requests.
In that case, the math would work as follows:
After uploading 3TB (3 * 2^40 bytes) as a single archive, your retrieval allowance would be 153.6 GB/mo (3TB * 5%), or 5.12 GB/day (3TB * 5% / 30). Assuming this one retrieval was the only retrieval of the day, and as it's a single archive you can't break it into smaller pieces, your billable peak hourly retrieval would be 3072 GB - 5.12 GB = 3066.88 GB.
Thus your retrieval fee would be 3066.88 * 720 * .01 = $22081.535 (719x your monthly storage fee).
That would be a wake-up call for someone just doing some testing.
[3] How do you think this interacts with AWS Export? It seems that AWS Export would maximize your financial pain by making retrieval requests at an extraordinarily fast rate.
[(edit) 4] Once you make a retrieval request the data is only available for 24 hours. So even in the best case, that they charge you based on how long it takes you to download it (and you're careful to throttle accurately), the charge would be $920 ($0.2995/GB) -- that's the lower bound here. Which is better, of course, but I wouldn't rely on it until they clarify how they calculate. My calculations above represent an upper bound ("as high as"). Also note that they charge separately for bandwidth out of AWS ($368.52 in this case).
[(edit) 5] Answering an objection below, I looked at the docs and it doesn't appear that you can make a ranged retrieval request. It appears you have to grab an entire archive at once. You can make a ranged GET request, but that only helps if they charge based on the download rate and not based on the request rate.
[(edit) 6] I think charging this way is more plausible because they incur their cost during the retrieval regardless of whether or how fast you download the result during the 24 hour period it's available to you (retrieval is the dominant expense, not internal network bandwidth). As for the other alternative, charging based on how long it takes them to retrieve it would seem odd as you have no control over that.
Former S3 employee here. I was on my way out of the company just after the storage engineering work was completed, before they had finalized the API design and pricing structure, so my POV may be slightly out of date, but I will say this: they're out to replace tape. No more custom build-outs with temperature-controlled rooms of tapes and robots and costly tech support.
If you're not an Iron Mountain customer, this product probably isn't for you. It wasn't built to back up your family photos and music collection.
Regarding other questions about transfer rates - using something like AWS Import/Export will have a limited impact. While the link between your device and the service will be much fatter, the reason Glacier is so cheap is because of the custom hardware. They've optimized for low-power, low-speed, which will lead to increased cost savings due to both energy savings and increased drive life. I'm not sure how much detail I can go into, but I will say that they've contracted a major hardware manufacturer to create custom low-RPM (and therefore low-power) hard drives that can programmatically be spun down. These custom HDs are put in custom racks with custom logic boards all designed to be very low-power. The upper limit of how much I/O they can perform is surprisingly low - only so many drives can be spun up to full speed on a given rack. I'm not sure how they stripe their data, so the perceived throughput may be higher based on parallel retrievals across racks, but if they're using the same erasure coding strategy that S3 uses, and writing those fragments sequentially, it doesn't matter - you'll still have to wait for the last usable fragment to be read.
I think this will be a definite game-changer for enterprise customers. Hopefully the rest of us will benefit indirectly - as large S3 customers move archival data to Glacier, S3 costs could go down.
I wasn't holding my breath, but I was thinking there's a possibility they were using short-stroking to speed up most of their systems hard drives by making a quarantined barely touched Glacier zone in the inside of their drives:
https://plus.google.com/113218107235105855584/posts/Lck3MX2G...
The Marvell ARM chipsets at least have SATA built in, but I'm not sure if you can keep chaining out port expanders ad-infinitum the same way you can USB. ;)
Thanks so much for your words. I'm nearly certain the custom logic boards you mention are done with far more vision, panache, and big-scale bottom line foresight than these ideas, even some CPLD multiplexers hotswapping drives would be a sizable power win over SATA port expanders and USB hubs. Check out the port expanders on OpenCompute Vault 1.0, and their burly aluminium heat sinks:
https://www.facebook.com/photo.php?fbid=10151285070574606...
That would definitely be cool. Pretty unlikely, however. When it comes to hardware, they like to keep each service's resources separate. While a given box or rack many handle many internal services, they're usually centered around a particular public service. S3 has their racks, EC2 has theirs, etc. Beyond the obvious benefit of determinism - knowing S3 traffic won't impact Glacier's hardware life, being able to plan for peak for a given service, etc - I'm guessing there are also internal business reasons. Keeping each service's resources separate allows them to audit costs from both internal and external customers.
Then there's failure conditions. EBS is an S3 customer. Glacier is an S3 customer. Some amount of isolation is desirable. If a bad code checkin from an S3 engineer causes a systemic error that takes down a DC, it would be nice if only S3 were impacted.
I probably shouldn't go into the hardware design (because 1) I'm not an expert and 2) I don't think they've given any public talks on it), but it's some of the cooler stuff I've seen, especially when it came to temperature control.
The math doesn't come close to replacing tape - basically once you go north of 100 terabytes (just two containers - at my prior company we had 140 containers in rotation with iron mountain) Glacier doesn't make financial or logistical sense. Far cheaper and faster to send your LTO-5 drives via driver.
It may not make sense today. Amazon is notorious for betting on the far future. They're also raising the bar on what archival data storage services could offer. When you ship your bits to Amazon, they're in 3+ DCs, and available programmatically.
Separate from the play for replacing tape, there's also the ecosystem strategy. When you run large portions of your business using Amazon's services, you tend to generate a lot of data that ends up needing to be purged, else your storage bill goes through the roof. S3's Lifecycle Policy feature is a hint at the direction they want you to go - keep your data, just put it somewhere cheaper.
This could also be the case where they think they're going after tape, but end up filling some other, unforeseen need. S3 itself was originally designed as an internal service for saving and retrieving software configuration files. They thought it would be a wonder if they managed to store more than a few GB of data. Now look at it. They're handling 500k+ requests per second, and you can, at your leisure, upload a 5 TB object, no prob.
But maybe you're right. The thing could fail. Too expensive. After all, 512k ought to be enough for anybody.
Thanks very much for the insight - what you are saying actually makes a lot of sense in the context of systems inside the AWS ecosystem. After all, they need to archive data as well. Also - my 140 container example w/Iron Mountain was Pre-versioning and always-online differential backups. We basically had a complex tower-of-hanoi that let us recover data from a week, a month, six months, and then every year (going back seven years) from all of our servers. (And, by Year seven, when we started rotating some of the old tapes back in - they were a generation older than any of our existing tape drives. :-)
Clearly, with on-line differential backups - you might be able to do things more intelligently.
I'm already looking forward to using Glacier, but, for the forseeable future, it looks like the "High End" archiving will be owned by Tape. And, just as Glacier will (eventually) make sense for >100 Terabyte Archives, I suspect Tape Density will increase, and then "High End" archiving will be measured in Petabytes.
Have you considered the cost of the tape loaders? Our loaders cost significantly more over their lifetime than the storage costs of the tapes themselves.
The tradeoffs will be different depending on how many tapes you write and how often you reuse them.
Until I took over backups, and instituted a rotation methodology, the guy prior to me just bought another 60 AIT-3 tapes every month and shipped them off site to Iron Mountain.
Agreed - how-often you re-use tapes (and whether you do) - has a dramatic effect on "system cost" of your backup system.
It wasn't built to back up your family photos and music collection.
But at its price points, with most US families living under pretty nasty data cap or overage regimes, it sounds superb, with of course the appropriate front ends.
There's no good (reliable), easy and cheap way to store digital movies, e.g. DVD recordable media is small by today's standards and it's much worse than CD-Rs for data retention (haven't been following Blu-ray recordable media, I must confess, I bought an LTO drive instead, but I'm of course unusual). And the last time I checked very few people made a point of buying the most reliable media of any of these formats.
In case of disk failure, fire, tornado (http://www.ancell-ent.com/1715_Rex_Ave_127B_Joplin/images/ ... and rsync.net helped save the day), for this use case you don't care about quick recovery so much as knowing your data is safe (hopefully AWS has been careful enough about common mode failures) and knowing you can eventually get it all back. Plus a clever front end will allow for some prioritizing.
Important rule learned from Clayton Christensen's study of disruptive innovations (where the hardest data comes from the history of disk drives...) is that you, or rather AWS here, can't predict how your stuff will be used. So if they're pricing it according to their costs as you imply they're doing the right thing. Me, I've got a few thousand Taiyo Yuden CD-Rs who's data is probably going to find a second home on Glacier.
ADDED: Normal CDs can rot, getting them replaced after a disaster is a colossal pain even if your insurance company is the best in the US (USAA ... and I'm speaking from experience, with a 400+ line item claim that could have been 10 times as bad since most of my media losses were to limited water problems), so this is also a good solution to backing up them. Will have to think about DVDs....
Very possibly, but who knows; per the above on disruptive innovations, Amazon almost certainly doesn't.
I personally don't have a feel for enterprise archival requirements (vs. backups), but I do know there are a whole lot of grandparents out there with indifferently stored digital media of their grand-kids (I know two in particular :-); the right middlemen plus a perception of enough permanent losses of the irreplaceable "precious moments" and AWS might see some serious business from this in the long term.
This was interesting to wake up to this morning ...
Right now we sell 10TB blocks for $9500/year[1].
This works out to 7.9 cents/GB, per month, so 7.9x the glacier pricing. However, our pricing model is much simpler, as there is no charge at all for bandwidth/transfer/usage/"gets", so the 7.9 cents is it.
7.9x is a big multiplier. OTOH, users of these 10TB blocks get two free instances of physical data delivery (mailing disks, etc.) per year, as well as honest to god 24/7 hotline support. And there's no integration - it just works right out of the box on any unix-based system.
We had this same kind of morning a few years ago when S3 was first announced, and always kind of worried about the "gdrive" rumors that circulated on and off for 4 years there...
I've spent several hours reading about this and talking with colleagues, reading the (really great) HN threads on the topic and doing a bunch of math - and I've come to the conclusion that rsync.net/backblaze/tarsnap/crashplan probably don't have too much to worry about for _most_ use cases.
The wonky pricing on retrieval makes this inordinately complex to price out for the average consumer who will be doing restores of large amounts of data.
The lack of easy consumer flexibility for restores also is problematic for the use case of "Help, I've lost my 150 GB Aperture Library / 1 TB Hard Drive"
The 4 Hour retrieval time makes it a non starter for those of us who frequently recover files (sometime from a different machine) off the website.
The cost is too much for >50 Terabyte Archives - Those users will be likely be doing multi-site Iron Mountain backups on LTO-5 Tapes. After 100 Terabytes, the cost of the drives is quickly amortized and ROI on the tapes is measured in about a month.
The new business model that Amazon may have created overnight though, and beats everyone on price convenience, is "Off-Site Archiving of low-volume low value Documents" - Think Family Pictures. Your average shutterbug probably has on the order of 50 GBytes of photos (give or take) - is it worth $6/year for them to keep a safe offline archive of them? Every single one of those people should be signing up for the first software package that gives them a nice consumer-friendly GUI to backup their picasa/iPhoto/Aperture/Lightroom photo library.
Let's all learn a lesson from [Edit Mat, one t] Honan.
Online backup for my photos and other data was my initial thought, but I'm afraid it would cost too much to do a restore- if I store 3 TB of photos/documents/etc for 2 years, then have a house fire (local backup destroyed), I want to be able to restore my data to my new computer as quickly as my Internet connection will let me, and I don't want to be stuck with a huge bill for retrieval on top of all the other expenses relevant to such a disaster. AWS should make the monthly retrieval allowance should roll over and accumulate from month to month, so that I can do occasional large retrievals.
re: "I want to be able to restore my data to my new computer as quickly as my Internet connection will let me"
Really? Why? If you have say 10 years of home pictures/movies, and you know they are 100% safe in Amazon Glacier, why do you need them all on your new computer as fast as possible? I don't understand why its such a rush.
If it's a rush, you pay the fee. If you can afford to wait a month or two or three to get all the data back for free, you trickle your pics/movies back to your new computer one day at a time.
It seems Amazon charges by the peak hour, so if you can throttle your retrieval so that it takes 3 or 4 days to get the data back, the fee would be a lot less.
A 5 GB per hour download would cost $36 for the month. You could download your entire 3TB files in less than a month for $36. So I don't think that's a crazy fee when your computer was destroyed by a fire...
To get your data back in a week requires 17GB per hour, which is $128. Not unreasonable either considering the urgency and the circumstances.
Agreed. Let me add that a lot of us are living under severe data cap or overage regimes, for me and my parents it's 2 AT&T plain DSL lines, each with 150 GB/month free, go over that 2-3 times and you start paying $10/50 GB/month on a line.
So uploading as well as downloading would have to be throttled. But this sounds like a superb way to store all those pictures and movies of the grandchildren, especially for those who don't have a son with a LTO drive ^_^. All the other alternatives are lousy or a lot more expensive.
I do my online backups to BackBlaze. $3.69/month and It backs up both my Internal 256 GB SSD and my External (portable) 1TB HD that I keep all my Aperture "Referenced Masters" on.
A Full Document restore isn't done over the Internet, I have Backblaze fedex me a USB hard drive - though, unless something has gone really, really wrong (Building burned down?) - I have a within-a-week old or so SuperDuper Image of my Hard Drive.
My Use Case for Glacier is Dropping a 10-20 year archive every 5 years. 50 Gigabytes of data will cost me $120 to leave their for the next 20 years. I can make good use of that.
For an allegedly "simple" archival service, that's a bizarre pricing scheme that will be hard to code around. If you wrote an automated script to safely pull a full archive, a simple coding mistake, pulling all data at once, would lead you to be charged up to 720 times what you should be charged!
First, the reason the "peak hourly retrieval rate" of "1 gigabyte per hour" is there in the article is to answer this question. At a relative allowance of 5.12 GB/day and 1 GB/hour transfer rate, that gives you a "peak hourly retrieval" of .79 GB (at 5.12/24, your first .21 is free), and so we multiply:
.79 * 720 * .01
Giving me a little less than $6.
Now, do you think Amazon is likely to think they can get away with selling a service that charges you $22k for a 3TB retrieval?
Second, you have ranged GETs and tape headers; use them to avoid transferring all of your data out of the system at once. [Edit: looks like ranged GETs are on job data, not on archival retrieval itself. My bad.]
10Gbps EC2 instances start at $0.742/hour. Welcome to the cloud. ;-)
I assume the cost is in retrieval though and counted per the Job Creation API, regardless of whether and how quickly you download the data.
but you're right that the 3TB/hour use-case is very hypothetical. Internet archival is just not suitable for those kind of volumes. I think the point OP was making that mistakes like using archives that are too large, or requesting many at once could cost you a lot.
If you actually USE 10gpbs your data transfer bill is going to be around $167k per month (That's for transferring 3.34PB).
Actually, a bit higher than that since I calculated all based on the cheapest tier EC2 will quote on the web, 5 cents per gigabyte.
For a one time 3TB download to an EC2 instance, priced at the first pricing tier of $0.12/gigabyte, that transfer will cost $360, and take around 40 minutes.
Afford a 10Gbps connection? You can buy 1Gbps transit for under $1/Mbps, and much less at 10Gbps. So, with a monthly bill of, say, $5K, for the 10Gbps IP, $22k is not quite "chump change".
I think you're making an incorrect assumption about which is the most plausible method for calculating the hourly retrieval rate.
The most obvious way to me would be to assume it is based on the actual amount of data transferred in an hour less the free allowance they give you. Which is actually what they say:
"we determine the hour during those days in which you retrieved the most amount of data for the month."
This also ties in with what the cost is to them, the amount of bandwidth you're using.
In your example you would need to be getting transfer rates of 3TB/hr. Given the nature of the service I don't think they are offering that amount of bandwidth to begin with. (I'm sure they get good transfer rates to other amazon cloud services but customers could be downloading that data to a home PC at which point they will not be getting anything even close to those transfer rates)
At that point a bigger issue might be how long it takes to get the data out rather than the cost.
At an overly generous download speed (residential cable) of 10GB/hr your 3TB archive would take over 12 days to download.
Given tc's edits above regarding additional charges for transferring out of AWS I'm starting to change my mind. I still can't believe amazon would ever end up charging north of $20k for a 3TB retrieval but it seems the intended use-case (as enforced by pricing) would be write-once read-never! Other use-cases are possible but as others have noted you would want to be very careful how you go about setting it up to avoid getting some ugly charges.
Based on the ZDNet article linked elsewhere on the comments, this system does not use any tape at all. It is all commodity hardware and hard drives, pretty much in line with the design of the rest of the services from AWS.
Having a multi-hour delay in retrieval lets them move it into their off-peak hours. Since their bandwidth costs are probably calculated off their peak usage, a service that operates entirely in the shadow of that peak has little to no incremental cost to them.
I'm not talking load so much as network traffic. Companies like Amazon and Google have huge peak hour outbound traffic during US waking hours, and then a huge dip during off hours. If they can push more of the traffic into those off hours they can make the marginal cost of the bandwidth basically zero.
So if you make a request at peak hours (say 12 noon ET), they just make you wait until 11 PM ET to start downloading, shifting all that bandwidth off their peak.
Even if you're just "flattening" the peak, when it comes to both CPU and bandwidth, that's a major cost reduction since their cost is driven by peak usage and not average usage in most cases.
Amazon has ridiculous internal bandwidth. The costly bit is external. The time delay is largely internal buffer time - they need to pull your data out of Glacier (a somewhat slow process) and move it to staging storage. Their staging servers can handle the load, even at peak. GETs are super easy for them, and given that you'll be pulling down a multi-TB file via the Internet, your request will likely span multiple days anyhow - through multiple peaks/non-peaks.
I was referring to the external bandwidth. Even if pulling down a request takes hours, forcing them to start off peak will significantly shift the impact of the incremental demand. I'm guessing that most download requests won't be for your entire archive - someone might have multiple months of rolling backups on Glacier, but it's unlikely they'd ever retrieve more than one set at a time. And in some cases, you might only be retrieving the data for a single use or drive at a time, so it might be 1TB or less. A corporation with fiber could download that in a matter of hours or less.
I get it - but I'm arguing that the amount of egress traffic Glacier customers (in aggregate) are likely to drive is nothing in comparison to what S3 and/or EC2 already does (in aggregate). They'll likely contribute very little to a given region's overall peakiness.
That said - the idea is certainly sound. A friend and I had talked about ways to incentivize S3 customers to do their inbound and outbound data transfers off-peak (thereby flattening it). A very small percentage of the customers drive peak, and usually by doing something they could easily time-shift.
I think the point is that they're trying to average the load across the datacenter, only part of which is Glacier. If they can offset all the Glacier requests by 12 hours, they'll help normalize load when combined with non-Glacier activity.
An uneducated guess? Maybe there is some type of spare storage pool just for staging the glaciar restore requests, and they've done the math to figure out how much space they need on average over time for this. The 24 hour storage expiration probably helps with this and they've calculated how much space they need to have on-hand and for spikes for restore requests and the restore delay helps factor in these demand spikes so they can move storage pools around on the backend if they need additional online storage capacity within the next X hours. Plus there could be limited bandwidth to these back-end archival arrays <-> restore pool hosts to save on cost etc which is also part of the pricing equation/delay time.
Hopefully, they've also calculated the on-call response time for the tape operator making the drive in to work(or to put the console game on pause and walk over DC). Unless they've come a long ways, robot/library drive mechanisms/belts often need adjustment. Besides, someone has to pick up the LTOs that slipped from the gripper.
According to this post[1] they charge based on how long it takes them to retrieve the data. The hourly retrieval rate would be the amount of data you requested divided by how long it takes them to retrieve it (3.5 - 4.5 hours).
If it takes them 4 hours to retrieve your 3TB, then your peak hourly retrieval rate would be 768GB / hour (3072 GB / 4 hours). Your billable hourly retrieval rate would be 768GB - 1.28GB (3072 * .05 / 30 / 4 hours).
Total retrieval fee: 766.72 * 720 * .01 = $5520.38 (~180x your monthly storage fee)
The pricing appears to not be optimized for retrieving all your data in one fell swoop. This particular example appears to be a worst case scenario for restoration because you haven't split up your data into multiple archives (doing so would allow you to reduce your peak hourly retrieval by spacing out your requests for each archive) and you want to restore all your data (the free 5% of your data stored doesn't help as much when you want to restore all your data).
A spokesperson for AWS confirmed this for me for an article [1] I wrote for Wired: "For a single request the billable peak rate is the size of the archive, divided by four hours, minus the pro-rated 5% free tier."
Good catch, in fact it totally fits with the description of the service as a store and forget for compliance and access only a small subset in the case of retrieval requests — for example when storing customer records.
I also must say that the way you calculate the retrieval fee is really looking like black magic at first sight. I hope they will add a simple calculator to evaluate some scenario and provide the expected bandwidth available from Glacier to an EC2 instance.
'Update: An Amazon spokesperson says “For a single request the billable peak rate is the size of the archive, divided by four hours, minus the pro-rated 5% free tier.”'
This seems to imply the cost is closer to 4k instead of 22k.However, the spokesperson's statement seems to describe intended system performance , not prescribe the resulting price. So if it actually does take them an hour to retrieve your data, you might still owe them 22k
I'm not seeing this at all for my use case. Unless I've figured it wrong, if I were to use this for an offsite backup of my photos, my ISP's Acceptable Use Policy limits my rate enough that I'm seeing only about a 10% penalty beyond normal transfer costs. See http://n.exts.ch/2012/08/aws_glacier_for_photo_backups for some sample "real-world" numbers.
3TB is a huge archive. I'm also not sure about your maths, billable peak hourly chiefly. [ed: dot multiplier for formatting]
Let's run 100GB, X. Allowance limit: 100GB . 5% is 5GB/mo, or per day, 100GB/(2030), 0.166GB/day; X/600.
Hourly rate necessary for a sustained 24 hour cycle of 100GB is: 100GB/24hr, or 4.166GB/hr, X/24. Peak hourly, this.
To determine the amount of data you get for free, we look at the amount of data retrieved during your peak day and calculate the percentage of data that was retrieved during your peak hour. We then multiply that percentage by your free daily allowance.*
To begin all that's stated here is, break your data-retrieval out over a day. Their example:
you retrieved 24 gigabytes during the day and 1 gigabyte at the peak hour, which is 1/24 or ~4% of your data during your peak hour.
We're doing 4.166GB in the peak hour/100GB in the peak hour, or ~4%.
X/24 / X = 1/24 = ~4.1666666% if you don't fuck your meteringly up.
We multiply 4% by your daily free allowance, which is 20.5 gigabytes each day. This equals 0.82 gigabytes [ed: free allowance hourly]. We then subtract your free allowance from your peak usage to determine your billable peak.
Free allowance hourly rate: 4.16666% . 0.166 = 0.006666, or (X/600/24), X/15000. A is at (12 . 1024)/15000, or indeed 0.8192 free, to verify.
billable peak hourly is then: hourly peak rate - free rate, 4.1666 - 0.00666 = 4.160, or (X/24) - (X/600/24) or (X-(X/600))/24 or (599X/600)/24 or simply, billable peak hourly will always be for sufficiently non-incompetent implementations: ~0.0415972222X. Always.
Let's check: 100GB . 0.041597 = 4.15970. Cannot compare to amazon, because their hourly rate is calculating a 24GB of 12TB archive download, but, 1-0.8192 still checks out. It would be 511.14666666 if their entire set, or (12 . 1024)/24 - 0.8192, 511.1808GB/hr peak hourly (nice pipes kids).
Retrieval fee is then, 0.041597X . 720 . tier pricing, and tier pricing I really do not understand the origin of at all but all examples seem to be 0.01. So, $29.95/100GB. For 12TB, say hello to $3680.25599 transfer fee. 3TB is $920.064.
720 . (599X/600) /24 /100, so for the transfer of your entire set X GB of data, evenly done across the day, you will be charged: (599X/600).(3/10)$,
that's certainly interesting. as there will be migration from s3 to glacier, it would be nice if tarsnap had an option to store only the (say) last week in s3 (with .3$/gb/month) and the rest in glacier (with, say, .03$/gb/month).
that would certainly be very nice. cperciva, what do you think?
I can't see any way for Tarsnap to use this right now. When you create a new archive, you're only uploading new blocks of data; the server has no way of knowing which old blocks of data are being re-used. As a result, storing any significant portion of a user's data in Amazon Glacier would mean that all archive extracts would need to go out to Glacier for data...
Also, with Tarsnap's average block size (~ 64 kB uncompressed, typically ~ 32 kB compressed) the 50 microdollar cost per Glacier RETRIEVAL request means that I'd need to bump the pricing for tarsnap downloads up to about $1.75 / GB just to cover the AWS costs.
I may find a use for Glacier at some point, but it's not something Tarsnap is going to be using in the near future.
While I have no idea how you would fit it in your current infrastructure, I certainly see a (BIG) use-case for: I have this 100 GB, store it somewhere safe (in glacier), I won't need it for the next year (unless my house burns down).
I agree that is a bit different from ongoing daily backups with changes, but its also not THAT different from a customer perspective.
That it doesn't fit with how you store blocks on the backend won't matter to a lot of customers.
i understand. i hoped tarsnap knew what blocks (do not) get reused.
it's unfortunate, because some backups happen to just lie around for very long. it would be nice to take advantage of (the low cost of) glacier for that.
that said, if it's not possible with tarsnap now, it's not possible now. :D. if you find a satisfying possibility to incorporate it in the new backend(s) design (if that's fixable in the backend(s) alone), i'd surely be pleased.
Here's to hoping that duplicity and git-annex could somehow make use of this service. I'm far more optimistic about duplicity support though, as incremental archives seem to fit the glacier storage model much better. A git-annex special remote [1] might turn out to be much more challenging, if at all possible.
From the retrieval times they are giving, it seems plausible that they could be only booting the servers 5 or 6 times a day, to run the upload and retrieval jobs stored in a buffer system of sorts. Having the servers turned off for the majority of the time would same an immense amount of power, although I wonder about the wear on drives spinning up and down compared to being always on.
Any other theories on how this works on the backend while still being profitable?
This sounds really appealing as a NAS backup solution, but I'm a bit concerned about security and privacy. Let's say I want to backup and upload my CDs and movies, would Amazon be monitoring what I upload and assume I'm doing something illegal?
In my experience Amazon doesn't care what you use AWS for as long as 1) your checks clear and 2) they don't receive any LEO interest in your data/service usage (i.e. WikiLeaks)
One recurrent issue with Amazon services is that they charge in US$ and currently do not accept euros. European banks charge an arm and a leg for micropayment conversions: last time I got a bill from AWS for 0.02$ it ended up costing me 20 euros or so. Pretty much kills the deal.
One solution could be to pre-pay 100$ on an account and let them debit as needed.
That's really odd, I've never heard of fees that high - in the UK I can use my credit cards with Amazon and only pay a few pennies in transaction fees. Have a look at the CaxtonFX Dollar Traveller, you need to load it with $200 but then there's no transaction fees I think, except a slightly worse exchange rate..
Sure there are workarounds. But why should I have to jump through hoops when my account at Amazon is already entrusted with a European credit card and merrily charging euros for all other goods?
This sounds like it's an issue related with your bank trying to scrounge more money off you. I have a European debit card and I get charged the equivalent in EUR, nothing more.
You'll have to ask your bank since they're the ones screwing you. I've been using my US Wells Fargo account in Europe, and my British Lloyds TSB account in Europe and the US, and I've never gotten screwed anywhere near that bad. Maybe 1-5% premium I had to pay in some cases.
But in any case, you should look into getting a USD account. I have a USD and EUR account from Lloyds TSB International and it is a good way to guarantee I never get screwed even a little bit when I'm traveling and doing a lot of small transactions.
Amazon S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years....
Which of course means that (if they're telling the truth) the probability of losing your data mostly comes from really big events: collapse of civilization, global thermonuclear war, Amazon being bought by some entity that just wants to melt its servers down for scrap, etc. (Whose probability is clearly a lot more than 10^-11 per year; the big bang was only on the order of 10^10 years ago.)
There's some clever wordplay/marketing here... "designed to provide 99.99..99%" means that the theoretical model of the system tells you that you lose 1 in X files per year when everything is working as modeled (e.g. "disks fail at expected rate as independent random variables"). If something not in the model goes wrong (e.g. power goes out, a bug in S3 code), data can be lost above and beyond this "designed" percentage. The actual probability of data loss is therefore much, much higher than this theoretical percentage.
A more comical way to look at it: The percentage is actually AWS saying "to keep costs low, we plan to lose this many files per year"; when we screw up and things don't go quite to plan, we lose a _lot_ more.
per object. So although the chance of losing any particular object is tiny, the chance of you losing something is proportional† to the number of objects. Still extremely small.
Yes. Though I bet the real lossage probabilities are dominated by failure events that take out a substantial fraction of all the objects there are, and that happen a lot more often than once per 10^11 years.
Agreed. More likely a catastrophic and significant loss for a small number of customers rather than a fraction of a percentage of loss for a large number.
Similar deal for hard drive bit error rates, where the quoted average BER may not necessarily accurately represent what can happen in the real world. For example, an unrecoverable read error loses 4096 bits (512 byte sectors) or 32768 bits (4k sectors) all at once, rather than individual bits randomly flipped over a long period.
I'm currently using an app called Arq that backs everything up to S3. If I had to guess, I'd say there's about 50-60 gigs or more on there. Last months bill was something like .60 cents. How does glacier compare or contrast to this setup (the app does something similar with the archive concept)?
Would love to know where you're getting that $0.60 number (I assume you meant 60 cents and not 0.60 cents). Even with the first-year free tier, it costs $7.00/mo ($6.00/mo on RRS) to store 60 GB of data on S3.
I'd also like to know how the bill was so low (I'm the developer behind Arq). Is it perhaps because it's the first month's bill and you haven't had the 60GB on S3 for very long (not a full month)?
Crashplan is still cheaper for storage larger than 400GB.
Crashplan+ Unlimited is USD 2.92/month if you take the 4 year package. When I upload 300GB to Amazon and pay 0.01 * 300 = USD 3/month. Amazon would be even more expensive for larger amounts of data.
Is there some fine print I'm missing with Crashplan unlimited?
Whenever I read about "Unlimited" plans (Backblaze has it as well, for $3.96 if you get a 1 year plan) I always think of things like Joyent and their "Lifetime" hosting, or AT&T and their "Unlimited" data plans. Their business plan is usually structured around people not actually using the service, and those who do use the "Unlimited" option usually end up either (A) being rate limited, or (B) having a conversation with the hosting/data provider to encourage them to transition elsewhere.
What's exciting about this, is that Amazon doesn't care _how_ much data you send them - presumably they've priced this so it's profitable at any level you wish to use. It's a sustainable model. Services like TarSnap/Arq will likely adopt this new service (Possibly offering tiered backup/archival services?).
I have (close to) zero doubt that Amazon's Glacier Archival Storage will be available 5 years from now at (probably less than) $0.01/Gigabyte/Month. They are a (reasonably) safe archival choice. Now that light users (<300 Gigabytes) have a financial incentive to move off of CrashPlan onto Amazon - it further exacerbates the challenges that "Unlimited" backup providers will face. All their least costly/most profitable may leave (or, at the very least, the new ones may chose Amazon first)
With that said - I love Backblaze (Been a user since 2008) for working data backups, rapid-online (free) restores - and I will continue to use them, but I wouldn't plan on archiving a Terabyte of Data to them for the next 20 years.
They put the provider in an adversarial relationship with the user and give them an incentive to keep you from storing data there. They will make it hard for you to use their service.
Also remember its free to recover your entire Crashplan archive (I have 500GB with them). If you wanted to recover 500GB with Glacier it would cost $200 @10MB/s (according to someones calculation further down) You have to pay for retrieval
Not totally true, you have to pay for retrieval for 1GB and upwards per month: http://aws.amazon.com/glacier/#pricing as well as 5cent per upload/retrieval request.
I just started a project where I'm keeping a raspberry pi in my backpack and am archiving a constant stream of jpgs to the cloud. I've been looking at all the available cloud archival options over the last few days and have been horrified at the pricing models. This is a blessing!
I am a bit confused with the pricing of retrieval.
Could some good soul tell me how much would cost to:
Store 150 GB as one big file for 5 years. To this I will add 10 GB (also as one file) every year. And lets say I will need to retrieve the whole archive (original file + additions) at the end of year 2 and 5.
They should provide a "time capsule" option - pay X dollars, and after a set number of years, your data archive will be opened to the public for a given amount of time.
There'd be no better way to ensure that information would eventually be made public.
Kinda: "In the coming months, Amazon S3 will introduce an option that will allow customers to seamlessly move data between Amazon S3 and Amazon Glacier based on data lifecycle policies."
Looks like the perfect solution to backup all my photos. Considering 100 GB of photos, that's still just 100 * 0.01 * 12 = $12 a year! I'm sold on this!
This looks like exactly what I need. I'm currently using S3 for backup and archiving by regularly running s3cmd to sync my data on my NAS.
And while not super-duper expensive, s3 provides much more than I really need, and hence a more limited (but cheaper) service would definitely be appreciated.
If there is anything with the easy of use like s3cmd to accompany this service I will be switching in a heartbeat.
I'm with Atlantic.net cloud [AWS competitor, full disclosure]; the price point for storage is great but retrieval seems expensive -- perhaps retrieval is very rare its offset by the savings on the storage. I know you can mail in drives for storage, can you have them mail you drives for retrieval? (for Glacier specifically)
Also, prior comments made mention they were using some sort of robotic tape devices, but according to this blog:
I don't think they're loss-leadering on storage, but if they are, they don't think they will be for long. AWS (EC2 and S3 in particular) does very well when it comes to profit margins. I suspect they'd like to keep it that way, and that whatever they're charging gives them some slice of profit, however small.
That'd make a ton of sense for forensic and compliance needs: lots of storage, limited access in special situations where a little delay is reasonable.
We just ran a quick cost forecast and it's interesting:
If you start with 100GB then add 10GB/month, it would cost $102.60 after 3 years on AWS Glacier vs $1,282.50 on AWS S3!
It always seem funny though that these companies say
"We will keep your data safe! *
* T&C's apply, if we lose it, you're on your own."
I shouldn't imagine Bank Vaults deal the same way with physical property. Can you get insurance for digital assets the same way you can for physical ones?
Can anyone make sense of the retrieval fees? Seems like the most confusing thing ever. If I'm storing 4TB and one day I want to restore all 4TB, how much is it going to cost me?
Admittedly ~$737 isn't the end of the world if your house has burned down and you need all your data back, but it's still important to know the details.
I think in that situation, it would be cheaper to use their bulk import/export, which would be roughly $300 for 4TB
This is so confusing. So apparently if you spend the entire month retrieving the data at 1.6MB/s it only costs $40 plus transfer fees? And more importantly, how do you throttle your retrieval?
Edit: So I'm working through a scenario in my head and trying to figure out how charging based on the peak hour isn't completely ridiculous.
I have 8GB stored to try out the system. This costs a whopping dollar per year. One day I decide to test out the restore feature. So I go tell Amazon to get my files and wait a few hours. When Amazon is ready, I hit download. I'm on a relatively fast cable connection so the download finishes in an hour. I look at the data transfer prices and expect to be charged one dollar.
But I didn't take into account this 'peak hour' method. I just used roughly 8GB/hour over the minimal free retrieval. This gets multiplied out times 24 hours and 30 days to cost 8 * 720 * $0.01 = $57. Fifty-seven times my annual budget because I downloaded my data too quickly after waiting hours for Amazon to get ready.
If you're in a corporate environment, you'd likely have a network admin group that could throttle your connection to them. Or you'd be using a custom-designed front-end that would throttle it. Or if you're SOHO, you could set up QoS rules on your router to throttle it.
Realistically though, this service might not be for you if fast and cheap retrieval of your data is important. The importance here is cheap storage, not transfer. They could reasonably expect that you'd only retrieve this data once or twice, if ever, and cost won't be deterrent. Say, if your data center burns down and your company is moving to a new office.
Oh I totally understand that people are going to be willing to pay more after a disaster to get files back. But realistically, if there is a large enough volume of files to bother amazon then they're going to need day(s) to download. If they rate-limited by day then the price would only reach a couple years of storage. The hourly thing is only going to bite minor retrieval events, and it is going to bite them amazingly hard.
My department (information security) was actually just discussing this service this morning at our morning meeting. We've been looking into backup services for our security monitoring appliance beyond our datacenter and DR site. These backups would need to persist for a year according to PCI/SOX compliance, and if we needed to show data to an auditor, we wouldn't need the entire log. In fact, we'd likely not even need 5% of it. Most likely, we'd only need to pull a day (maybe two) from the logs.
I can imagine our media services group talking about the same thing, how to keep master files of their product pictures/videos where they'd only need to grab a file or two here or there (if at all).
Amazon seems to be pushing this as a file dump where retrieval of the files is exceedingly rare. They don't want you to use it as a hard drive, they want you to use it as a magnetic tape drive.
You're right it is complicated. But it also depends on how much you have archived because you get to access 5% of your data on a prorated basis too and that's based on your peak hourly rate.
But ultimately, this product isn't designed for backup purposes. It's designed for archive purposes. If you have 4TB of customer data from 3+ years ago that you never access, but need to keep in case the IRS does an audit, then this is the place to put it.
Amazon should complement this service with data contact centers which are connected to their data centers network. Then people could go to these centers in person and hand over their hard drives full of data for back up. It will be like bank lockers but only digital. At this low price people would want to upload terabytes of data which will be pain to upload/download.
This looks awesome! We are currently developing a P2P-based backup solution (http://degoo.com) where we are using S3 as fall-back. This will allow us to be much cheaper and I am sure it will enable many other backup providers to lower their prices to.
I hope they can get the access time down from 3-5 hours to about 1 hour - that's the difference for me between it being a viable alternative for storing backups of my client's web sites or not.
I might create a script that uploads everything to Glacier and just keeps a couple of the latest backups on S3 though.
Per Werner's blog post [1] "in the coming months, Amazon S3 will introduce an option that will allow customers to seamlessly move data between Amazon S3 and Amazon Glacier based on data lifecycle policies."
Does anyone know if there is a CLI tool to interface with this yet? I see SDK's mentioned on the product homepage but I dont see any simple CLI tools for this yet to upload/download data and query etc.
I'm curious about data security. It says it is encrypted with AES. But is it encrypted locally and the encrypted files are transferred? I.e. does Amazon ever see the encryption keys?
Or is the only way to encrypt it yourself, and then transfer it?
AFAIK the keys are managed by Amazon, just like S3. It's more for compliance reasons rather than real security. Encryption still has to be done yourself to protect the data.
It will be interesting to know if they will upgrade AWS Storage Gateway to use this kind of backend instead of S3
http://aws.amazon.com/storagegateway/
It's a bit strange how S3 has a 'US Standard' region option, while Glacier has the usual set of regions (US East, US West, etc). I wonder if this means that unlike S3, Glacier isn't replicated across regions?
Keep in mind that even if your data is in one AWS region, it'll still be stored in multiple different datacenters some distance apart. Just not on the other side of the US.
I just saw this and thought wow finally cheap mass storage. After reading the comments (and the Amazon Glacier web page) it's clear it's cheap archiving but not cheap retrieval.
Any desktop apps out there that will let you add folders for backup to Glacier and have them be automatically synced up to the cloud as they change? That would be quite useful.
Excellent! Is there any good open source client / backup application for this? I would start using it immediately. I'm currently using ridiculously expensive backup solution.
I wonder if Glacier support in Duplicity will be possible without large changes. AFAIK, duplicity also reads some state from the remote end to determine what to backup (Although it also keeps a local cache of this?). To use glacier, the protocol would have to completely write-only.
I'd guess it would use a hybrid approach, with recent backups on S3 (which duplicity already does) being shifted to glacier after a period of time. The FAQ indicates that Amazon plans to make this easy.
In case anyone is wondering it appears you can only upload from their APIs right now. I wonder if they intend to make it accessible through their web interface at any point?
With that said - Backblaze is optimized for working documents - and the default "exclusion" list makes it clear they don't want to be backing up your "wab~,vmc,vhd,vo1,vo2,vsv,vud,vmdk,vmsn,vmsd,hdd,vdi,vmwarevm,nvram,vmx,vmem,iso,dmg,sparseimage,sys,cab,exe,msi,dll,dl_,wim,ost,o,log,m4v" files. They also don't want to backup your /applications, /library, /etc, and so on locations. They also make it clear that backing up a NAS is not the target case for their service.
I can live with that - because, honestly, it's $4/month, and my goal is to keep my working files backed up. System Image backups, I've been using Super Duper to a $50 external hard drive.
Glacier + a product like http://www.haystacksoftware.com/arq/ means I get the benefit of both worlds - Amazon will be fine with me dropping my entire 256 Gigabyte Drive onto Glacier (total cost - $2.56/month) and I get the benefit of off site backup.
The world is about to get a whole lot simpler (and inexpensive) for backups.