Hacker News new | past | comments | ask | show | jobs | submit login
Ice cold archive storage (cloud.google.com)
140 points by tpetry on Apr 11, 2019 | hide | past | favorite | 79 comments



The interesting part it's cheaper than AWS Glacier ($4 per TB per month) and slightly more expansive than AWS Glacier Deep Archive ($0.99 per TB per month) but the data is available immediately and not in hours like glacier where you have to pay a hefty premium for faster access to the data.


Interesting, unlike Glacier this is significantly cheaper than Backblaze B2, meaning I might have to reconsider how I do my backups again. Any good backup tools supporting this type of service?

I rely on Restic at the moment which seems to need fast read access to data, but their incremental snapshotting is great. It'd be ideal if I could find something like that supporting these "cold storage" solutions.


One thing I do consider a real value add for AWS Glacier though has been their native support for offline media import/export. Ie., you can just send them a hard drive of your own for data load, and pay to get a hard drive back out as well. As gigabit (or faster) class WAN slowly spreads this will someday become unnecessary, but right now in many, many places a company could easily have terabytes to backup but 10/1 ADSL as their best available connection. Even with faster connections, aggressive data caps are sadly not infrequent. Whether it's for initial load, ongoing use or faster recovery, sometimes there is still nothing like a multi-TB drive or two in the mail.

There are 3rd parties that will do it for you (Iron Mountain is at least one) but that's an extra cost and Google takes no responsibility for it. I assume this is an example of a place where Amazon is able to leverage its wholistic business, with a Cloud service that can also take advantage of their physical logistics system. Google's service here is quite significantly cheaper and has some nice features though, but even if it's not worth a $4/$1.23 premium for Amazon I could definitely see continuing to pay Amazon some premium ($2 vs $1.23 say) for that alone anywhere with limited high speed WAN availability.


Disclosure: I work on Google Cloud.

We also have a Transfer Appliance [1], that comes in two sizes (100T and just under 500T). We don’t currently support shipping one filled up with your data for recovery/export though.

[1] https://cloud.google.com/transfer-appliance/


Backblaze also offers that option. You can mail them up to 8tb on an external HD and have it loaded into their system for $190, up to 256gb on a USB stick for $100. [1]

You can also request a "B2 Fireball" [2] from them. It's basically a small array that they mail to you for $550 with 70tb of storage. You fill it up and send it back to them within the month, and they'll load the data into your account.

[1] https://www.backblaze.com/b2/cloud-storage-pricing.html (Bottom of the page)

[2] https://www.backblaze.com/b2/solutions/datatransfer/fireball...


For comparison, Amazon supports up to 16TB in their basic service, with an $80 flat handling fee per storage device and then $2.50 per data loading hour. Since they support 2.5"/3.5" SATA and external eSATA & USB 2/3.0 interfaces and it's a pure sequential transfer, it's not much trouble to get at least close to maximum sequential, which even for decent spinning rust should allow a good half TB an hour at least. I've never tried an SSD so I'm not sure if they can saturate 6 Gbps, but as even a 32 hour transfer of 16TB would only be another +$80 it may not be generally relevant anyway.

Amazon's equivalent to B2 Fireball is "AWS Snowball" (amusingly enough, not sure if there is a bit of fun name riffing between the two here), which is a service fee of $200/50TB and $250/80TB device, any onsite days after the first 10 at $15/day.

It's interesting how the pricing mix is on this feature though. Amazon offers lower potential ingress pricing depending on your use, though notably if you kept the Snowball a whole month the pricing would get very close to the Fireball (+20 days @$15/day brings the price to $500/$550 respectively, though the former with 20TB less and the latter with 10TB more).

Backblaze and Google are both much cheaper to get data out of though, Amazon's Glacier and descendent services remain very much deep freeze focused.


What's the shipping costs on a Snowball or other appliances?


AWS's new Glacier Deep is actually cheaper than Google's Ice Cold, $1/TB/month.

https://aws.amazon.com/about-aws/whats-new/2019/03/S3-glacie...


Those retrieval costs, though...


Anyone know the retrieval cost for Ice Cold? I don't see it mentioned in the post.


If it's the same as their other storage, which isn't really clear... About $50 /TB


On B2 that's $10... yeah, might be reconsidering this...


I expect that Google will also charge for retrieval. Their egress is really really expensive.

There may also be a minimum storage period, like Amazon has.

Let's wait and see.


It looks like a lower tier than the existing Coldline and Nearline (7x cheaper for storage than the former). Both have a minimum period, so this one is likely to have one as well. Coldline and Nearline are more expensive than regular storage when fetching objects, which means ice cold storage is probably even more expensive when you restore (is it going to be 7x, too, keeping symmetry?).


Is their egress more expensive than Amazon's? Because when I had a look at that, it sure wasn't cheap either.


I've never tried it, but I know https://www.arqbackup.com/ supports Google Cloud.


The concept, idea, and flexibility of Arq is great, ideal even. The amount of control is nice. I wish it were open source.

The actual product is pretty painful when you need to do a recovery, especially if you don't know where the file lived on disk. I haven't tried newer Arq Cloud Backup destination to see if it improves the search experience.

That said my experience is from more than a year ago and I would try it again if they were able to bring their search on par with current consumer backup offerings.


The place where this won't be as cheap as Backblaze is retrieval. Unless Google makes a big change, you'll still have to pay for network egress, which is obscenely priced: https://cloud.google.com/storage/pricing#network-egress


Borg Backup is mostly the same as Restic (regarding dedup / incremental backup) [1] and aggregates data into large chunks.

If you only backup from a single machine it has a local cache of already backed up data, this has the large advantage that it basically only needs to push the delta data to the remote, not do any kind of synchronization to check what is already there or not.

[1]: https://stickleback.dk/borg-or-restic/


"Borg Backup is mostly the same as Restic (regarding dedup / incremental backup)"

... with one very big difference - you can only point borg at an SSH host. You can't point borg at S3 or B2 or Glacier, etc.

rsync.net supports both borg and restic, but even the heavily discounted plans[1] are much more expensive than "Cold Storage" or Glacier, because they are live, random access UNIX filesystems ...

[1] https://www.rsync.net/products/borg.html


Shameless plug: I built a backup service[1] just for Borg and the price per TB on the large plan is $5/TB. Not as cheap as "cold storage", but still better than rsync.net and the same as B2.

Also worth pointing out that my storage is calculated after compression and deduplication. So depending on the data a Borg backup can be much smaller than the actual data.

1: https://www.borgbase.com


Interesting. I've been backing up to a storage node at time4vps. I have an older plan at about $15/ quarter. https://billing.time4vps.eu/?affid=1881


True - which is kind of weird, because as far as I understand their respective "databases", borg would be more suited for arbitrary remote storage because it should only need a "upload file" command basically without any interactivity, except for its robustness checks and some additional flexibility (having multiple backup sources, deleting data that is no longer needed).

Restic seems more made from the ground up to utilize the existing power of a filesystem as a database, so it needs remote storage that offers quick interactivity (esp. checking existing files), i.e. it's impossible to use something like Glacier as a backend.

It's not a problem for me since I just backup to a local drive and (am planning to setup) synchronization to a remote dumb storage.


I need more information than this post provides before switching archival solutions.

Actually, since it’s google I likely wouldn’t consider them regardless.


I've been using duplicati for some time. It works ok, not perfect. Wish I could send backups to multiple locations especially (eg local/B2)


What tool do you use to do your backups? rclone?


Rclone only copies stuff. It doesn't compress, deduplicate or version. Some backends to versioning though.


> Unlike tape and other glacially slow equivalent

shots fired I like when multinational corporations with revenues the size of midsized countries engage in some childish puns


The title seems like a reference to the 2003 Outkast song Hey Ya! https://www.youtube.com/watch?v=PWgvGjAhvIw

Alright alright alright!


Contrast: Petabox, from the Internet Archive.

https://archive.org/web/petabox.php

Density: 1.4 PetaBytes / rack

Power consumption: 3 KW / PetaByte

No Air Conditioning, instead use excess heat to help heat the building.

Raw Numbers as of August 2014:

4 data centers, 550 nodes, 20,000 spinning disks

Wayback Machine: 9.6 PetaBytes

Books/Music/Video Collections: 9.8 PetaBytes

Unique data: 18.5 PetaBytes

Total used storage: 50 PetaBytes

Costs are $2/GB, lifetime, I believe.

https://help.archive.org/hc/en-us/articles/360014755952-Arch...


Does anybody know what the retrieval fees will likely look like? I've been wary of most of the "cloud archival" solutions because while they're cheap to put data into, they seem charge you a billion dollars to actually retrieve it.


FWIW, this is still an ideal model for backup storage: If your more regular backup model is robust and your network is well-secured, you'll never need retrieval. And if you need it, you need it, and it's justifiable to spend big to save your business.


Backup plans don't mean much unless you fully test a restoration process periodically though.


I'd be confident with periodically testing just little random parts.

For me this is a "last resort backup", costs little to keep around, and god-forbid we ever need it. BUT that means we need to account for the case were we do need it! And if it's going to cost too much then there's no point in the backup anyway.


I would generally agree. First of all, you're going to test a lot of your restore processes with backups which are closer to home: You should make sure your VMs can all restore from your onsite (or just less icy) backups, for instance. As long as you're confident in that, the only thing you need to test with "ice cold" storage is that you can successfully restore a single VM from it, since you know all of your VMs can be restored.


I'm mostly just evaluating it for personal/hobby backups (few terabytes), I know business will for sure look at it differently.


Same here. As a company you can go "we need this to save our asses, I don't care if it costs $50k in a 4 person company", but personally I kind of do care about the cost for retrieval...

I've been comparing cloud storage prices to hard drive prices for years now. My first thought when seeing the storage prices was "huh, that might actually be worth it", but depending on the retrieval costs, you might still want to roll your own no matter the storage costs. For private use, I am (was?) planning a variant of this as soon as I am finished doing a server migration: https://old.reddit.com/r/DataHoarder/comments/7rjcdn/home_ma...


Also compare with ftp and storage vps. Cheapest ive seen is 1$ per TB /month.


A cheap storage VPS won't give you 11 9s of durability, though. Chances are a single drive failure will cause data loss.


It's your backup, not your primary system. The odds that more than one drive fails within the same, say, week, is probably perfectly acceptable for most people.


You can hash the data at regular intervals to make sure it's intact. For example adding this command to crontab:

    /usr/bin/md5sum --quiet -c md5sum.chk


What is the meaning of the claim about "99.999999999% annual durability"? Does that mean one chance in 100B of an object being unretrievable?


Backblaze wrote a good piece on that https://www.backblaze.com/blog/cloud-storage-durability/


Thank you for that link! It's very interesting to me that all the calculations laid out here assume independent failure events.


It is universal practice within cloud service providers to span redundancies across "fault domains" - basically things which could make failures correlated, like being in the same machine/power strip/datacenter/geo region. If you assume your fault domain analysis is good, then your failures should be independent. Many global outages are the result of a previously-unidentified fault domain, like the Azure certificate issues. Of course past a certain point it becomes unimportant - who cares about your data if an asteroid takes out every datacenter on Earth at once?


Nothing in engineering is 100%. As far as engineering goes 11 9s is pretty much as good as you can get. For comparison, AWS S3 and Glacier are 11 9s durability too.


It's worth bearing in mind the difference between durability and availability. Durability is roughly the chance of losing your data over a given time span (in this case a year), whereas availability is about how reliably you can access the data (and is almost certainly a lot lower than 11 9s). A service can be very durable but have very poor availability.


"What is the meaning of the claim about "99.999999999% annual durability"?"

It has no meaning whatsoever. Someone on the marketing side of the team decided that was a "competitive" number to present, outwards, and someone in engineering was tasked with, working backward from that number, coming up with some plausible calculation that resulted in it.

In the real world, they, like Azure and Amazon, will have single point in time outages that will wipe that out for a year or more.

Here is what an honest assessment looks like:[1]

"Historically (updated April, 2019) we have maintained 99.95% or better availability. It is typical for our storage arrays to have 100+ day uptimes and entire years have passed without interruption to particular offsite storage arrays."

...

"In the event of a conflict between data integrity and uptime, rsync.net will ALWAYS choose data integrity."

[1] https://www.rsync.net/resources/notices/sla.html


> In the real world, they, like Azure and Amazon, will have single point in time outages that will wipe that out for a year or more.

An outage affects availability, but as long as it's not permanent it doesn't affect durability. For example, if I add a new backup provider that stores data on-premise I've added a (nearly) independent data store. This substantially decreases my risk of losing my data unrecoverably (increases durability) but if I don't set up any sort of automatic failover I'm still at risk for substantial outages (no practical increase in availability).

For example, I don't believe Amazon has ever lost any S3 data (https://www.quora.com/Has-Amazon-S3-ever-lost-data-permanent...), and if they did it would be a big deal. Same with the other major cloud storage providers.

> Someone on the marketing side of the team decided that was a "competitive" number to present, outwards, and someone in engineering was tasked with, working backward from that number, coming up with some plausible calculation that resulted in it.

I would be incredibly surprised if that happened. That's not the way I've seen anyone work here.

(Disclosure: I work at Google, though not in Cloud)


You are mixing availability (access at any given moment) with durability (not losing data). From the FAQ:

Cloud Storage is designed for 99.999999999% (11 9's) annual durability, which is appropriate for even primary storage and business-critical applications. This high durability level is achieved through erasure coding that stores data pieces redundantly across multiple devices located in multiple availability zones.

Disclaimer: I work at GCP, although not in GCS specifically.


"You are mixing availability (access at any given moment) with durability (not losing data)."

You are correct - I misread that as availability even after quoting that very same line.


Anyone care to speculate on the technology that allows them to offer the fast retrieval times and low cost per GB?


"Fast" is relative here. It's fast compared to Glacier and others, but it's going to be slower than the more expensive tiers.

You might not need to speculate much about how it works, it's probably implemented as described by Google themselves in slides 22ff. here:

http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Ke...

(In a nutshell, pack some hot data with a lot of cold data on many large drives, then put a Flash-based cache in front of them to get long tail performance predictability back.)


Thanks for the link! Interesting stuff.


There was also a talk about the low level storage service and the performance isolation work that allows it to mix batch and latency-sensitive traffic on the same drive, but it doesn't seem to have been recorded: http://www.pdl.cmu.edu/SDI/2012/101112.html

Gory details are in the patents, 9781054, 9262093 and 8612990, which I'm not linking directly, because your lawyers might not approve. There's even a follow up, 10257111. It's so new, from two days ago, that Google Patents can't find it, while Justia can.


You have to keep the data in that class for a certain period of time. That's the drawback. You can access the data at that price as long as you keep it there for a long time.

So I suspect this is not a fully cold storage. That's why they can retrieve the data faster. Seems more like an economics hack (Longer commitment to keep the data, allows them to buy and operate the storage hardware/software at a cost that can be amortized against those commitments)


It's a pretty good price but assuming you are storing 8TB and you get your own drive, the drive would pay for itself in about 14 months... so you would basically get the next 4 years for free if you are willing to manage it...


Will that storage have "11 9’s annual durability" and stored in multiple location?

Let say that you only need to write to it once, have 2 secure location available for free, that would still means that you need 2 of them which would pay for itself in 28 months then.

Sure it's "cheaper" but it's far from being as good and the price difference isn't that big.


Optimally running your own drives also assumes that you can fill the drives ... and the next byte doubles your cost.


Google offers some interesting services, but their API is always so awfully complicated and cumbersome that I've given up entirely trying to use anything.


If you can use cp to move files around, you can use gsutil to do the same for GCS.

https://cloud.google.com/storage/docs/gsutil


I use Glacier as a sort of backup of my backups .... and was thinking about Glacier Deep, but this is tempting too.


In case you're still thinking about options, I would be fine to host a variant of this for you: https://old.reddit.com/r/DataHoarder/comments/7rjcdn/home_ma...


While one major use of something like this would be backups, how does one handle these backup sets with respect to GDPR requests? The window to respond is 30 days, so keeping backups longer than say 25 days seems cumbersome. You would need hot access to the sets to load them up and delete the data.


Encrypt backup data with a per-user key, keep the keys only in hot storage, delete the key when a user is deleted.


Wont that make the backups useless in case of a data loss (i.e. always)?


You don't keep a single copy of each key, but store enough redundant copies to get the proper number of nines. Preferably that's redundant geographically, in terms of storage technology, and in write frequency.

The important part is just that the keys don't end up in long term cold storage. Either it's only retained for a short period (e.g. tape backups that get rotated after two weeks), or it supports live deletion.


Encrypt the backups and store the encryption key in a normal non-archival bucket.


What are the transfer costs for storage and retrieval?


I was also looking for that, the only piece of info about that was:

Unlike tape and other glacially slow equivalents, we have taken an approach that eliminates the need for a separate retrieval process and provides immediate, low-latency access to your content. Access and management are performed via the same consistent set of APIs used by our other storage classes, with full integration into object lifecycle management so that you can tier cold objects down to optimize your total cost of ownership.


It seems to have the same pricing like the other storage classes: No fees for accessing the files within the same region and the typical bandwidth fees if the backups will be downloaded to somewhere else.


Intuitively there is some cost for retrieval. Otherwise you'd just use the cold storage to store everything.


Like every cloud storage file access costs money, so the operation to access it. But its so minimal its basically non existent for a backup solution.


Nearline and Coldline have a per-byte retrieval cost in addition to the increased-but-still-low cost per operation.

https://cloud.google.com/storage/pricing


[flagged]


They downvoted you because it would be really easy for you to look the prices up by yourself. Will take only a few seconds:

* Go to https://cloud.google.com/compute/

* Scroll down to the pricing information

* Click on the link for the price list


What does the age of someone have to do with them downvoting you? You asked a question that you could have easily answered yourself.


Google has burned people so many times with shuttering products with little to no warning I'd be hesitant to trust them with my long term data storage.


Eh... for consumer stuff, sure, and perhaps even for new/experimental GCP features. But this is storage, a core function, on GCP, an enterprise service with actual contracts and SLAs attached.


Just don't think of it as something you'll ever want to restore unless the building burns down and you've lost everything.

Glaciers restore costs had a lot of fees in my one experience. We could have bought several RAID units for the price of a fast restore. If you asked for it back over a long period of time, the price dropped dramatically.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: