Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Low-cost backup to S3 Glacier Deep Archive (github.com/mrichtarsky)
120 points by mrich on Sept 16, 2022 | hide | past | favorite | 140 comments
Hi,

most people (hopefully) have local backups. However, when that backup fails, it is good to have a backup stored somewhere off-site. In the old days you would ship physical drives/tapes, which is cumbersome, costly, and slow. With fast upload speeds, it is now possible to upload your data to the cloud. I have found S3 Glacier Deep Archive to be a great solution for this:

- It is very cheap ($1/TB/month for US region) - Very reliable (99.999999999% data durability, data spread over 3 Availability Zones)

However, usability out of the box is not that great, I'm not aware of any automated backup solution for Deep Archive. This free project provides that.

Currently, ZFS is required, but that might change. Please try it out and provide feedback!




Note that, in the event of retrieval, it's not just per GB, it's also per request.

Synology has offered Amazon Glacier and S3 as a destination option with Hyper Backup for years as part of their NAS offerings. Given the available automatic archive feature to move an existing store to Glacier Deep Archive, budget permitting I'd recommend a NAS over this for three reasons:

- Initial setup costs aside, the power draw of a two bay unit like the DS218 (15W at load) would be ~$16/year at peak usage assuming a cost of $0.12/W

- Uploading/syncing your local files to your NAS should be considerably faster, technically 'free', and can be done more frequently as you desire; should you need them, it would also be 'free' to retrieve them locally barring a catastrophic event

- The remote push of the NAS contents to S3/Glacier storage can be done asynchronously of your PC's state (and, to save money, less frequent if you wish), which as you point out could take days; additionally, you can save money given you can reduce the number of requests via automatic archiving/compression

Given how unlikely it is for you to retrieve data from Glacier Deep Archive with such a setup, I highly recommend it. You can still rest knowing your data is offsite.


Backups should always be tested atleast once to ensure that they do in fact work. Nothing worse than losing data, and discovering that you've been paying a service for naught all those years. This could get prohibitively expensive if you can't replicate your setup in the cloud to test backups within AWS network.

Or you could just blindly trust that it'll work. Kind of like schrödingers backup


You could make a backup of one text file. If you can restore that then you can likely restore something 1E9 larger.


Could be, but it could also be a fluke. Perhaps everything after the first file-index file is corrupted, and you only restored things from the first one, or perhaps maybe the entire diff-index can be so large that it's not actually recoverable once you grow it to the size you want.

There are a myriad of easy-to-miss failure scenarios that don't appear in small-scale tests. Only way to ensure you can restore your backups, is to restore your backups


> Note that, in the event of retrieval, it's not just per GB, it's also per request.

This should fall in the free tier in most cases, since the script creates archives of the specified size. The free tier has a few thousand requests for free.


> Synology has offered Amazon Glacier and S3 as a destination option

For the record, QNAP does too, I've got an automated job that backs up any new files it sees every Sunday morning.


What does 99.999999999% durability mean exactly? Does it mean a probability of 0.000000001% (1 in 100_000_000_000) that your bits will randomly disappear? Is that yearly?

One interpretation is that about 1 bit per 100 GB will randomly flip each year. That or S3 Glacier is expecting to hit a catastrophic event every 100 billion years (which doesn't seem nearly frequent enough).


Also note how Amazon doesn't claim that the service has 11 nines of data durability, but that it is designed for 11 nines of durability. That's not quite the same.

Backblaze apparently gets the same number by looking at the failure rate of their drives, and how long it takes to recover from a drive failure (reaching the same redundancy as before the failure). By being able to recover from three drive failures before the first drive is rebuilt, they get 11 nines of durability [1]. Or in other words: 0.000000001% of drive failures are non-recoverable and lose data. I assume AWS does basically the same calculation, adjusted for whatever technology they use.

1: https://www.theregister.com/2018/07/19/data_durability_state...


This explains it pretty well:

https://blog.synology.com/data-durability

> Data durability: the ability to keep the stored data consistent, intact without the influence of bit rot, drive failures, or any form of corruption. 99.999999999% (11 nines) durability means that if you store 10 million objects, then you expect to lose an object of your data every 10,000 years.


Doesn't AWS only spread your data over three availability zones, which are three data centers in the same city, often located in close proximity of each other? I'd rate the possibility of Amsterdam suffering a catastrophic flood or nuclear strike that wipes out three data centers at once at more frequently than once every 10,000 years.


I mean we're already pretty certain that their current design isn't enough to last when the sun explodes so that puts a hard limit of all data disappearing once every 10 billion years or so.


Think we'll have an AWS region in Alpha Centauri by then...

Edit: and it'll still break when us-east-1 is destroyed by the sun exploding.


"an object"

Shouldn't that be multiplied by the number of bytes of this object? I still don't quite understand the math apparently because then we'd all be losing 10KB+ files yearly.


I don’t think so. The question is: for any given file, what is the likelihood that if I restore it, there will be some corruption? (Total loss, bitrot, anything)

For the purposes of this calculation, you would probably use the average size of a stored file, determine the number of chunks that was split into, and then the likelihood of losing/corrupting a chunk.

By looking at the durability of objects, you (a) reinforce the object-store nature of S3, and (b) avoid reporting “durability per kb”, which isn’t linear, and more difficult to interpret.

Honestly, the durability number is so high, that adding the bytes as a unit would just make it difficult to reason about. And S3 doesn’t store “bytes” — it stores objects (made of bytes). So, that’s the metric they chose to focus on.


If we assume for the sake of argument simple RAID 6 setup, to lose say a block of data you'd need to

* have a bad block on one drive * have a bad block in *EXACTLY SAME LOCATION* on the second drive * have a bad block in *EXACTLY SAME LOCATION* on the third drive * all of that happening within one scrub period (which is usually once a month).

And that's just simple RAID6 setup without external redundancy.

As you can see it would be far easier to lose data due to just correlated drive failures, than from random block corruption.


Not sure exactly. But Backblaze has been pretty transparent on their approach.

Backblaze takes user data, splits it into 17 pieces, add 3 pieces of redundancy, and split it across 20 racks. Backblaze has a tiered response system, my memory is something along the lines of 1 of 20 disks dying = schedule for replacement the next business day. The 2nd disk triggers a quicker/higher priority response. 3rd disk triggers pages and all hands on deck.

Given that disks have a annual failure rate on the order of 1%, and losing the first disk has a chance of 1% * 1/365, if a second disk is 4 hours that 1% * 1/(3656), and a 3rd disk is 1%/(36524) you end up with lots of zeros:

0.01(1/365) 0.01 * 1/(3656) 0.01 * 1/(365*24) .00000000000000014281

That's just to remove the redundancy, you still need to lose another disk before data is lost.

Sure there's other failure methods, including those that could take out an entire data center. I did find a statement saying that data stored in the US-West region would be stored in both Sacramento and Phoenix, but not sure if that means each B2 object uploaded would be in both.


I use rclone and Backblaze B2 for this. B2 used to be cheaper though not sure if that’s true with the new deep archive but it is much less fiddly and no crazy fees at restore time.

Rclone is also multi threaded so goes much faster compared to rsync


Here's my "me too" — I've been happily using rclone for things like photo archives (together with my small consistency checker to check file hashes for corruption https://github.com/jwr/ccheck). I also use Arq Backup with B2 as the destination. This gives me very reasonable storage costs and backups I can access and test regularly.


"me too" - rclone with Azure Archive tier for last-resort backups. I like that rclone doesn't tie me to one storage provider, and I'm not sure why anyone would use the linked project over rclone.


As mentioned in another reply below, I found these issues with rclone, perhaps you can tell if I'm missing something:

- It will not create archives by default, so potentially upload a lot of files (just like rsync). There are a few workarounds in the documentation to mitigate the cost.

- It cannot do restore of Deep Archive data

Edit: Can do restore, but it's a manual step.


Sorry, I don't know about that, it just does what I want out of the box.


With backblaze B2 both of those are not an issue. So not sure.


S3 Deep Archive is two time cheaper than B2. BUT you are not supposed to just rclone to S3 Deep Archive because it charges you based on the number of files you upload.

The upload fee (PUT request fee) is $0.065/1000files so it will charge you a lot when you have millions of files.


Yep. Best to tar backup chunks as a whole before sending to DA.


B2 is $5 per TB per month, versus S3 Deep Archive ~$1 per TB per month.

So if hot access isn't a concern then S3 can be 1/5th of the cost.


I've been doing that via truenas scale for the past year. The upload job hangs then fails. There must be a bad permission or filename somewhere, but I don't know what the right tool is for debugging this so I've just put it off. If anything's missing it can't be large (judging from the space taken on B2).


Since I'm not that familiar with the pricing model behind this, did I understand correctly that it costs roughly $1/TB/month to store and roughly $95/TB to restore? The price seems steep at first, but comparing it to regular backup services where the cost usually adds up to roughly $100 per year it starts to make sense. I have backups but I don't think I would ever need them more than once a year, even that would be alarmingly often.


The price is not steep at all. This is a last resort solution, when all your other backups have failed and you need to retrieve your data. The idea is that you never need to restore it.


Glacier's primary business use is not as a backup solution but as a regulatory compliance tool. The idea is that you can guarantee regulatory bodies that whatever historical data that needed by law to be kept for x years has been preserved based on AWS' own claims of persistence.


I can totally believe that, but I meant in the context of this post and what the OP is presenting.


The backup and store service is cheap, general outgoing bandwidth is very non-trivial as it also applies to every other service in AWS; It's designed to keep your data in and heavily discourage multi-cloud


With a little setup (no more than described here), rclone just... does this out of the box.

Specifically I have an S3 remote configured to use the Deep Archive tier. On top of that I have an encryption remote (pointing to the S3 remote). Then I just rclone my pool to this remote, and all my crap is shipped off to Ireland.

Like in the link, I expect never to need it; restore is so expensive that it's a "house burns down" insurance only.


I use rsync.net as a secondary backup and it's especially handy because you can easily configure it to keep N snapshots for every X days/weeks/months/quarters/years


I found two problems with rclone, would be interested to hear if I misunderstood anything or if you have a workaround:

- It will not create archives by default, so potentially upload a lot of files (just like rsync). There are a few workarounds in the documentation to mitigate the cost.

- It cannot do restore of Deep Archive data


Edit: Can do restore, but it's a manual step.


Thanks, sounds interesting, will take a look. Do they store files 1:1 or do they create tar archives? Single files are quite ineffective on Deep Glacier.


I've moved all my stuff off of glacier because of their ridiculous pricing model, and user-hostile metadata handling.

I.e. you have to maintain your own index of files, where they could have just done this for you.

The pricing model for downloads is too easy to shoot yourself in the foot. I'd rather pay a tiny bit more to not have bankruptsy traps built into the product. So that's what I do now.


What I would really like to have with AWS is some kind of prepaid scheme, first to avoid surprise costs at early stage, second to fill up enough to keep resources running in case I get into coma or my digital identity gets disrupted by some glitch. I accept that if I won't topup on time then resources are gone.


> AWS is some kind of prepaid scheme

I heard from a friend who chose AWS to handle backups specifically because he prepaid for them, that presumably is already possible.


I think maybe you're referring to the original Glacier vault model? Nowadays you can put objects in an S3 bucket (with your own object name and metadata) and set the storage class to Glacier Flexible Retrieval or Glacier Deep Archive. Then you can use the S3 APIs to get object lists etc. Also, they got rid of the insane "peak hourly request fee" stuff when downloading. https://www.arqbackup.com/blog/amazon-glacier-pricing-change...


Glacier is a different product than S3 Glacier. Plain Glacier is indeed hard to use, but S3's Glacier storage tier makes it all pretty easy to use.


Can you share alternatives you found, if any?


> 3 or more AWS Availability Zones, providing 99.999999999% data durability

I think you need to multiply that by the durability of amazon as a company...

I suspect that in any given year there is perhaps a 1% chance that they shut down AWS with no chance to retrieve data. (either due to world war, civil war in the usa, change in laws, banning you as a customer, change in company policy, bankrupcy, etc.)


Usually when that happens you are given a warning and a window (often of many months) when you can still access your data.

And, most likely you still have your original data. P(Data loss) = P(Bankruptcy happens dec 2022) x P(drive failure happens dec 2022).

The likelihood of the company going down one day, without notice, and, coincidentally, your regular data store getting trashed the next day is extraordinarily low. Absent some kind of end of the world scenario the two events are, for all intents and purposes, independent and low probability.

If you're planning for that scenario you've probably slipped over the line from sensible backupper to prepper.


In that scenario, your local copy is the backup!


> in any given year there is perhaps a 1% chance that they shut down AWS with no chance to retrieve data

That would translate to 64% chance in 100 years. Seems plausible :)


I think it has already happened to customers based in Iran last time the USA changed their stance? I don't think those customers got any notice to download data.


The distribution of probabilities over a 100 year period won't be uniform


"Banning you as a customer" is like a billion times more likely than any of those other things. That's the only real thing to be concerned about in that list. AWS is too big to fail; if it came to it, I bet the US Government would bail them out. But they can ban you at any time; that's a serious concern of mine and the main reason why I mirror all of my AWS data to Backblaze B2.


This service already exists for a very long time. I have been using it for many years with https://www.arqbackup.com as a fallback for my TimeMachine and Backblaze backup. Google also offers a very similar service as Glacier called Coldline: https://cloud.google.com/storage/docs/storage-classes#coldli...


Arq does not seem to be in the same ballpark as Deep Archive with regards to pricing. Also I needed a Linux solution, they support Win/Mac only unfortunately (same as Backblaze).


I'm confused by this comment. Arq can store your backups in your own AWS account, using Glacier Deep Archive storage class. It's your own AWS account, so you pay exactly the Deep Archive price.


I understand, didn't see this mentioned on the pricing page directly. How do they store the data in Deep Glacier (single files or archives? the former would be very expensive?) Do they have restore fully integrated? (since this is a two-step process that takes 12+ hours I could imagine it does not fit the normal flow)


Arq stores the data in content-addressable format. Arq's data format is documented here: https://www.arqbackup.com/documentation/arq7/English.lproj/d...

Arq manages the restore process, telling AWS to make Glacier/Deep Archive objects "downloadable", waiting for them to become downloadable, and then downloading the data.


Arq is a GUI client that stores your backups in the location you set it up for. See screenshot in the "Back up to your own cloud account." section on https://www.arqbackup.com.


This often disregards the cost of retrieving said data, which is at $90/TB for outboud network traffic on top of the costs of making the backups available.


This doesn’t disregard that. It lays it out with some nice numbers so that it’s clear. It’s much appreciated.

I’m very happy to pay USD 200 if I ever need to retrieve a last chance backup. If I just want the data in no rush I can trickle it back over 20 months for free.


Outbound traffic costs always cost the same regardless of your retrieval speed to API. If you only have a couple TBs worth of high-value data, this may be acceptable to you. It's far from being a universal truth though.


But you get 100 GiB/month free download :)


Downloading backups for a decade is untenable


100 GiB per month for a decade... how many terabytes of irreplaceable pictures have you got?

But, yeah, making use of a free service loophole, it taking ten months for a terabyte, and presuming you're using zero other Amazon services that count towards your free trial amount, I agree with you it's a stretch. But if you're really cheap or hate Amazon with a passion (but not enough not to host with them) it's not that far fetched.


I use tarsnap for off-site backups :

https://www.tarsnap.com/

It’s probably not as cheap as glacier but it’s cheap enough for my needs, secure and encrypted and was very easy to set up.


I use rsync.net's borg offering, which is similar (encrypted + deduplicated) but 17x cheaper (1.5 cents vs. 25 cents per GB-month, and also zero bandwidth fees). I think it's even less with the HN discount; I'm too lazy to check.

https://www.rsync.net/products/borg.html

To be completely fair: since Tarsnap doesn't have a minimum order size, it's still a better value in the 0-6 GiB range.


I pay $150/year for BorgBase backup and store 1.66TB, mostly because I really like Borg backup and am comfortable with it. That's my last-tier backup. I highly recommend BorgBase, they're good folks.

I also pay $120/year for Google drive which is my "online" backup, then store the files locally of course.

Seems to work OK for me, and I'm insulated against the problem of: "google's algorithm decided I was a Bad Person and terminated my accounts".


Edit: was wrong.

Original:

0-40 GiB, due to the minimum and assuming value means "$ to store all my X GiB" as opposed to "$/GiB"


I can't follow your math. If I had 40 GiB, it would cost $10.00/month to store them on Tarsnap (40 GiB * $0.25/GiB-month), or $1.50/month to store them on rsync.net/borg (the 100 GiB minimum order).

The breakeven point is lower, at 6 GiB.


Two pricing pages.

You looked at https://www.rsync.net/products/borg.html

I looked at https://www.rsync.net/pricing.html, because I clicked pricing and didn't scroll down.

You are correct.


It’s definitely not as cheap: $1 buys you 4GB/month.


This recalled a horror story about huge charges when retrieving the data. I searched the link and seems like pricing was changed since the blog entry [1].

Still good idea to check the extra charges when reading.

[1] https://medium.com/@karppinen/how-i-ended-up-paying-150-for-...


This is mentioned in the link, about $100 to get your 1TB out.

I can see home backup use cases where you might be happy to get it out at 100Gb/month over time. For example, a home photo/video library where you can view arbitrary photos, and slowly pull out all of them.


I think Amazon Glacier should be considered the third or fourth backup. First being local, second cloud or distributed and then fourth as last resort off-site. Used when everything else has catastrophically failed.

At which point 100$ per TB isn't that bad.


.


Just a single drive can be more than 16TB nowadays, having just a couple of those makes $90/TB seem quite high.


That's 90$ for redundant tape storage, stashed away securely in some place far away from your home. A 16TB drive is cheap, but you need a few of them to be safe against file corruption and bit rot and you need to stash them somewhere where a house fire, robbery or natural disaster can't reach them. Also, the storage itself is quite cheap, just the retrieval is expensive. But, as other people said, if you need to get this data, that few dollars won't be your problem.


Disasters often bring financial difficulty, in which cost of repair is very much a problem. It's hard to justify for any individual when much cheaper and easier to use and understand services are available. It's great if your enterprise is looking for a compliance checkbox to tick, Glacier can save a lot of time in that case.


Yeah, it's putting low cost up front, but if you need the retrieval NOW, you're going to pay it. On any cloud provider, you should absolutely be reading their pricing carefully so you can account for the cost of these situations.


As mentioned, AWS has simplified the pricing model, so getting it all at once isn't hugely expensive anymore.


outbound traffic costs don't change at all regardless of when you need your data, it will always be atleast $90/TB


> 99.999999999% data durability, data spread over 3 Availability Zones

I'd love to see what source(s) this claim has in practice. How did you arrive at 9-9s? Shouldn't it be spread over 3 Regions rather than Availability Zones, as otherwise it could just end up in the same geographical region while giving the impression it's spread across many.


It's not my claim, here's the link to the AWS page:

https://aws.amazon.com/s3/storage-classes/#____

You could upload to multiple regions just to be sure.


Also there is afaik no SLA for this! Just marketing.


this is all over AWS documentation; you get the durability via erasure encoding / replication over multiple AZs in one region. Regional blast radius protects again one AZ going down. You don't run a service like S3 across AZs in different regions because of speed of light. if you are paranoid about a region going down you use replication.


Isn't the chance of nuclear armageddon something like 30% this century?, that should affect the probabilities


And this is how Flash-drive-on-cockroach As A Service was born



Can't you also just use existing tools to backup to s3 and then move it to deep archive?


You can, though this will incur some costs related to temporarily storing your data in S3, as it either won't be immediately transferred when using lifecycle policies, or will incur extra charges if you force a type change based on filesizes and filecount.

Note that there is also a minimum filesize of 128KB per object in Glacier as well as 32KB extra metadata, and everything smaller will also be counted as the minimum size. Some mitigate this by bundling files in larger chunks, at the cost of retrievability and now having to keep an map of file-bundle associations.


I've created similar functionality just using a simple bash script that will send the latest version of ZFS datasets to S3/Glacier, including dealing with incremental changes. I have mentioned this previously on HN and got a few useful changes submitted for it, especially making it more platform agnostic.

I have some open tickets asking about (script based) restoring. I haven't tried this yet as this has been a backup of last resort for me, but hopefully posting this again will nudge me into looking at that.

https://github.com/agurk/zfs-to-aws/


> I'm not aware of any automated backup solution for Deep Archive.

What about the well-known rclone? https://rclone.org/s3/


I would also like to mention restic where rclone can act as a backend but you get a very nice frontend for backup


I use Arq with AWS Deep Archive as my off-site. Seems to work well, though admittedly I haven't tested the recovery/retrieval yet.


How does Arq store your files on Deep Archive? Does it create archives or create the same structure you have locally?


You provide an encryption key and tell it which folders you want to encrypt from your local machine. Then the app provides a GUI which shows the folder structure that'd been mirrored to AWS and you can selectively restore what you need.


I checked it, it's quite good. Also during restore, waits for the files to become available for download. But personally I don't like the fact that files are packed into some blob storage, which is not as easily accessible as a tar archive.


That's good to know. Previously I'd used Amazon Drive, Google Drive and alternatives where I had restored some files. Always worked as desired. I was already switching from Google Workspaces to cut costs, and priced out how much I'd save by using Glacier for my off-site backup, and it was a no-brainer to make the switch. This guy had a decent write-up on the pricing:

https://clete2.com/posts/backup-glacier-deep-archive/


I’m curious, has anyone really experienced any data loss on public storage service like s3? I’m not sure if the count of 9s actually matter …


We experienced a few dozen unrecoverable files on Rackspace Cloud Files (which at the time used Openstack under the hood). This is out of about 10M objects that totaled about 2 TB, but they were less than 3 years old. The API call just failed with a 404 even though the files were listed in the list API call.

The corruption was uncovered when we started migrating to another cloud platform, we had to restore from local copies.


https://github.com/andaag/zfs-to-glacier

I built something similar a while back that I've been using for years now.

Something worth noting. There is a minimum cost to files. If you have tons of tiny kb sized files (incremental snapshots..) it's drastically cheaper to fallback to s3 for them.


You still need to check your backups every once in a while. Glacier is priced such that most people don’t check their backups. This could be worse than no backup.

Also, one may frequently add and prune snapshots. Costs of this should be considered too. You may use hot storage, but pruning usually removes old data which is in cold storage.

Does anyone here check glacier backups?


Have you considered using plain S3 buckets with Intelligent Tiering AND the two opt-in archive access tiers? You can use the normal S3 apis to upload, then after 180 days or so, your objects transition to Glacier Deep Archive. You do pay a penny per 1000 objects, but the benefit here is using S3 like normal. You still have to wait hours for restore.


But you would pay quite a lot for the time your data is still in S3 blob, no?


For onsite backups, is it a valid option to buy a spindle of blu-ray discs and swap away until you have another copy of everything with enough par files included to account for a few years of bit rot?

Or is copying everything to a new, larger hard drive every year and keeping a few years of drives still the best choice?

Edit: for personal use, for sure! I think BD-R goes up to 100GB.


With the generally available rewriteable blu-ray capacity, doing this would be quite burdensome to archive even some couple terabytes. Would recommend multiple online harddrives, preferably far apart from each other and some append-only mode to prevent accidental deletion. Or a cheap online-backup service, or both.


That's essentially a more expensive (manpower-wise) and less reliable alternative to tapes.

Keep in mind that what might be considered 'a lot of data' varies significantly. A medium commercial company here in west EU is going to require about 0.5PB of backup space for a single data rotation. In other areas you might get away with storing 100TB or even less.


Hetzner storage boxes are also worth a look due to their colourful range of connection options.

Borg restic rclone sftp etc


Some questions that weren't answered in the readme: - What type of encryption is used exactly? Is it a simple encfs which leaks some meta information or is a container created or something else? - Is it a full snapshop backup or does it work incrementally?


It's aes256 using openssl:

https://github.com/mrichtarsky/glacier_deep_archive_backup/b...

Does that leak information you would be concerned about?

It's always a full backup.


Nope. There are some concerns about using openssl for file encryption and it is generally good to be very specific about the encryption instead of using default parameters. I don't necessary share all the concerns, but it might be worth knowing them, e.g. https://security.stackexchange.com/a/182281


Thanks for the pointer. gpg is the better option then, I've switched to it.


Still a thumbs up because all backup solutions I and a lot of my friends are aware don't cover most use cases and I am happy to see people are trying to fill some of those gaps


Aside from the general concept issues that others are addressing (e.g. "what does 99.999999999% durability mean exactly").

There's also code smell.

I had a quick glance through and couldn't help noticing the stench of assumptions and poor (or non-existent) exception handling.


Definitely not enterprise-ready code but I'd say it's solid. Please point out some specifics.


Is there any good OSS solution that supports multiple servers and modern storage targets ?

There is Bareos/Bacula but that just pretends everything is a tape and generally works badly/quirky because of that.


rclone seems to fit the bill:

https://rclone.org/


This one looks strictly like "a server backing up its local things to the cloud". Like, I haven't seen it and it looks like great solution for my NAS, but not exactly to deploy on servers that don't even have access to the internet in the first place and are backed up currently via bareos agent.

I guess there isn't much OSS there because that's pretty much strictly enterprise space


Try Wasabi! It's amazingly affordable $5.99 per TB/month WITHOUT fees for egress or API requests. Storing all my backups there and they are not single-DC anymore.


Cheap archive. But very expensive data out from aws. omg


This is part of why my main backups aren't on something like that plan. I like to checksum them fully occasionally, no matter what assurances the provider gives about them protecting from bit-rot, and that could get quite expensive.


How would you preserve an encryption key for such a backup outside of the digital world, for personal purpose, and also unlock it after my death?


There is no single good answer for this. Data generally needs to be constantly taken care of to survive. This will necessarily require some amount of trust since you can't be in control when you're dead, and after the timer unlocking, just about anyone with physical access to the key can open containers with it


Hmm, I just ship the files directly to Glacier using the aws cli -- aws s3 sync /foo/bar s3://<bucket>/bar/


Wasn't aware of this :) The problem is that this will sync files as-is without creating archives. Not sure how it's implemented internally, if they're using a single request per file this would get expensive quickly. In any case it will be much more expensive since each file costs a small extra amount for storage. Also curious whether they can handle auto restore during sync cloud -> local. Will give it a try.


Checked it, it uploads single files. This is very expensive. Also, they only have server-side encryption. Also cannot handle restore :(


Yeah, I wasn't doing an entire disk backup or anything - some hundreds or thousands of largish files.


What is the maximum size of each .tar.zstd.ssl ?

We have a storage with 40TB with 100 million files, then what is the expected number of archive files?


S3 max filesize is 5TB, so with that you'd have to have atleast 8 archive files. You can use byteranges to fetch only a single file out of it if you have the means to determine that range.


Hmm, conflicting info?

https://aws.amazon.com/s3/faqs/

> Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 TB.

https://docs.aws.amazon.com/amazonglacier/latest/dev/uploadi...

> Using the multipart upload API, you can upload large archives, up to about 40,000 GB (10,000 * 4 GB).

Since the archive must be restored to S3 before download, I wonder what will happen if you upload and restore a 5 TB + 1 byte archive?


Uploads are not free as described in the project I think. Unless I’m misreading, AWS’s cost page showed about 5c / 1000 files.


You are right, I will mention this in the README. Note that the script creates archives for this purpose, it does not store single files. E.g. in my local setup I have an archive size of 50 GiB, so 20 files/terabyte. This makes the API costs negligible.


Why not mention this straight in your post:

"Restore and download is quite costly:

Restore from S3 tape to S3 blob: $0.0025/GiB ($2.56/TiB) for Bulk within 48 hours $0.02/GiB ($20.48/TiB) for Standard within 12 hours Download: The first 100 GiB/month are free, then 10 TiB/Month for $0.09 per GiB ($92.16/TiB) and discounts for more."

TLDR if I read it correctly 2.56+92.16 USD to get your 1TB back home

Not that bad, but I feel like buying every half year 1TB drive for ~50USD and just storing it wherever outside your home would be cheaper option. But it depends how often you need to perform backup.


It is not that bad price, when home and place of your offline backup burned down and google or whoever locked your account.


Or if you're willing to dribble the data back home at a rate of 100GiB/month then it's $2.56/TiB.


Do you really want to spend 10 months retrieving that data? I don’t think there's a single real-life scenario where you do need data, just not this year. So $2.56/TiB is unattainable.


Also, the issue with waiting 10 months will be you now only have one copy - the one on S3 and you may lose that data. Not because of the 99.9 recurring % guarantee of course, but another fubar issue - like someone hacking into your account - or some fubar with the backup script - and now you have no backup of the backup.

If you need the last resort backup, then you would probably want to revert to normal operations ASAP where you have the original and the first resort backup back in place!


Glacier is nice, but due to the cost of data retrival, people should use it as the restore of last resort.


for deep archiving the major question for me is for the tool to do client side encryption easily, I never understand the server-side encryption, if you put your keys there, the server now have both key and content and of course they can in theory decrypt things at will.


The question is what are the costs when you need to retrieve the data.

They prices might be higher next year.


Put all the data you want in, pay ungodly amounts to get it out tho.


Yeah, it’s definitely a backup of last resort for that reason. A bit like the Roach Motel: data checks in, but it doesn’t check out.


Low cost until you need a (partial) restore...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: