Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What's your approach to personal data archiving and backups?
32 points by monroewalker 9 days ago | hide | past | favorite | 31 comments
I'm currently just relying on cloud services to backup data but I'm thinking it's time to start using some offline solution both to avoid the risk of account lockout and to avoid having to trust third parties with sensitive data.

I often see comments on HN which favor self-hosting and personal data management over use of cloud services, so I'm curious what approach people here take.

I've only just started doing the research for this, but my plan so far is just to buy a couple high capacity hard disk drives which would be mirrors of each other. Occasionally I'd copy files over from my computer to one of the drives and occasionally I'd mirror the data on one drive to the other. Also wondering if I should just re-use some existing 2.5" drives I have from old laptops or if it's more prudent to purchase a new drive that might be manufactured with long-term durability / stability in mind.






A good approach is to use the 3-2-1 backup strategy. You should have 3 copies of your data (1 working set and 2 backup copies) on 2 different media (disk, dvd, NAS, cloud…) with 1 copy off-site for disaster recovery (cloud, external disk at another location…). Depending on the type of data, on how paranoid you are and the budget available you could do extra variations. i.e. 4 set of data, different filesystem (NTFS, exFAT, btrfs, ZFS, ext4…), different disk manufacturers, different batches of hard disks, different NAS enclosures, different methods (soft RAID, H/W RAID, error correction…), encryption and storage in friends house and so on :-) If you can afford it buy a dual disk NAS and also use your spare 2.5 disks in external usb cases and synchronize your data every once in a while.

I have given out this link countless times:

https://www.jwz.org/doc/backups.html

I love this line:

> "OMG, three drives is so expensive! That sounds like a hassle!"

> Shut up. I know things. You will listen to me. Do it anyway.


Warning to those who might be new to jwz's site - copy and paste that link into your browser, because if the Referer header is HN he serves an NSFW picture instead.

You actually make me curious to see that pic but I clicked it and dont have that problem.

Here is the image in case anyone is interested: https://cdn.jwz.org/images/2016/hn.png I have to admit that Jamie (jwz) has done important things and has contributed a lot to many others but I have to disagree with most of what is mentioned in the page related to the backup procedures, but hey this is the web and anyone has its own views and ideas on what is best and on what to display to a redirected user from HN :-)

The advice this guy gives is far from ideal:

First, when drive fails, it often doesn't tell you right away. If you rsync --delete , there is high chance your deletions would copy. Same goes for corruption.

Second, keeping drive with potentially private data unencrypted is a liability. Especially offsite. Someone will steal it and all your private stuff is out. I am not even considering case when you store crypto on said PC


> Second, keeping drive with potentially private data unencrypted is a liability. Especially offsite. Someone will steal it and all your private stuff is out. I am not even considering case when you store crypto on said PC

Otoh, keeping your storage encrypted means losing your keys is losing your data. And your heirs are likely unable to access it.

You have to weigh your threats.


I think people need to get a strategy - any strategy - before they can think about these sort of second-order details.

I have scripts to do two things:

1. Use rsync to backup everything to a Synology NAS every day

2. Backup from the NAS to tarsnap (cloud service with cheap, encrypted backups) every few weeks.

Theres only a few important ideas and a whole lot of ways to accomplish them:

1. Backup frequency and retention policies. I forgot the term for this, but you want something that prioritizes more frequent backups for nearer term data. For example, see the Apple Time Machine backup policy:

"hourly backups for the past 24 hours, daily backups for the past month, and weekly backups for everything older than a month until the volume runs out of space"

2. Don't micro optimize decisions like which hard drive is most reliable. It's a waste of time and ultimately any drive can fail, the filesystem on that drive can fail etc. Instead, use backups. RAID gives you disk-level redundancy, and an extra backup on the cloud will almost never fail because they have pros making sure of that.

3. A backup you haven't tested restoring from is NOT a backup.


I primarily care about photos and videos. My solution is simple and has worked well for the last 7 years.

                                      /--Dropbox
  Synology NAS -> normalize file names 
                                      \--Google Photos
  
I wrote this tool to normalize folder and file names <https://github.com/jmathai/elodie>.

I wrote about the rest of the system in the following posts.

1. https://medium.com/@jmathai/introducing-elodie-your-personal...

2. https://medium.com/@jmathai/understanding-my-need-for-an-aut...

3. https://medium.com/@jmathai/my-automated-photo-workflow-usin...

4. https://medium.com/@jmathai/one-year-of-using-an-automated-p...


Someone pointed me to these [1], M-Disk, 25 pack of 100GB for around $260. That last for 1000 years. Probably some marketing numbers but I will be happy even if it was 100 years. It is still on my research list I have look at how and why it works and if they really that good in real life. But [2] seems to be reassuring.

I dont think the cost is expensive on a long term TCO. Assuming it really last that long. The problem is the time to burn those Data and they are practically not searchable.

I hope there are next gen storage, optical or not that brings larger data count and longer life cycle

I am thinking of a vague idea if there could be an NAS, where it has two drive, one for your local copy, the other one is used for storing bit and pieces of Data of similar brand of NAS from other users for recovery purposes.

[1] https://www.amazon.com/Verbatim-98914-M-Disc-100GB-Surface/d...

[2] http://www.microscopy-uk.org.uk/mag/indexmag.html?http://www...


Sounds like you want a modern equivalent of one of the old Auspex/NetApp servers that used distributed RAID HDDs for the the head of every file, backed by optical disk robots for the rest, so for large files, you get a nice cross of the performance of HDD with the cost of optical.

Also, no one here has mentioned Duplicati yet (https://duplicati.com). I switched last year, and it's flat wonderful.

The versioning and backups just work, it's free and open source, runs on pretty much everything, has good encryption baked in, and supports a really broad array of backend storage protocols for both cloud and local. Oh, and it comes with a pretty decent scheduler and a web-based GUI interface.

Quite honestly, the best piece of OSS I've adopted in the past two years. I no longer pay for any backup software (just S3/Wasabi storage), and I no longer have to do any integration or management of the moving parts - it just works. I like that, as I no longer have the patience for crappy software now that we're two decades into the 21st century.


My setup is the following:

1. Multiple laptos get rsync'd to a WD MyBookLive network drive every 4 hrs (Yes, I know about the recent WD issue - see note at the end)

2. The WD MyBookLive data gets rclone'd to Microsoft OneDrive Business Basic Account every day (used to be Amazon CloudDrive - see note)

3. The WD MyBookLive data gets rclone'd to BackBlaze B2 every week

Notes:

1. Yes, there was a huge security mess with the WD MyBookLive, where people had all their data deleted if they enabled UPNP and allowed the drive to punch a hole in their NAT. But never expose any such "IoT" devices directly on the WAN - always block all these devices on the router WAN interface completely and ssh mount your network drives on a "home server." Have cron jobs on the server do the sync with cloud drives. MyBookLive is much less expensive and if you keep it off the WAN can work very well as a NAS.

2(a). Amazon CloudDrive used to be a good service to use, even after they got rid of their "unlimited" plans at $5/mo/1TB. But then they blocked off rclone and now there are close to zero clients that it works with - the only clients that work are their slow web interface and an "odrive" client that almost never syncs.

2(b). Microsoft OneDrive offers a nice "Business Basic" plan for $5/mo/1TB that seems pretty good.

3. BB2 storage is $0.005/GB-month - slightly more expensive than Amazon Glacier at $0.004/GB-month, but its download cost at $0.01/GB is better than Glacier's at $0.09/GB


I have a machine with two 6TB RAID 1 arrays. I have a backup script that runs rsync against a bunch of hosts periodically to one array. Then, weekly I run borg backup from one array to the other. Between borg compression and deduplication I have months of backups on the backup array.

Then at least once a month I bring in an offsite 6TB USB drive and rsync the borg backup to it after manually reviewing that the current borg backups are “sane”.

The main advantage to this approach is that even with complete physical destruction I can access/restore anything within hours. Additionally, going back to old versions or checkpoints locally is extremely fast (seconds to minutes) no matter the file size. I also have occasions where I create/change a lot of data in the same day, to the point where a former overnight cloud backup process would occasionally fail to run because it was still running from the night before!


Redundancy is your friend. If you are serious about data retention, you should be storing your most important data offsite. If you are serious about privacy, this may require backing up to a hard drive and physically storing that offsite yourself.

For personal data? I don't. If it's important, I make hardcopies, everything else is ephemeral and ultimately replacable. Not worrying about backups saves me a lot of time, money, and stress.

I got a syncthing server power by S3QL that stores the data in backblaze.

Syncthing is not backup but more synchronization. But for simple stuff is more than good enough. Especially for pictures that I want to keep forever.

https://redbeardlab.com/2021/08/03/my-syncthing-setup-cheap-...


I use Seafile, but that’s not really important since you can use any other Dropbox style app. Although, everything is stored on ZFS locally in a RAID-Z2 configuration with native ZFS encryption. I can lose up to two drives out of six in my setup with no data loss. Then I take snapshots automatically and push incremental updates to a third-party server and since they are encrypted they can’t read the data either.

Made script that automates this:

1. rsync to USB disk into snapshot/ (with global and per-directory filters)

2. on USB disk, borg from snapshot/ to borg/

3. rclone borg/ to multiple object storages (e.g. B2)

"USB disk" refers to LUKS-encrypted ext4 partition on USB attached rotating hard drive. I use this for my PC and some remote servers (rsync can pull over ssh).


For now I just rclone (https://rclone.org/) a few directories to B2 or S3. I created an application key with write only permission to the cloud storage.

Subquestion:

Does anyone have any good tools to pull (hopefully backed up) data off of failing disks?

What about imaging disks that are in filesystems your computer cannot read?

Ideally, these would be crossplatform, but suggestions across OSes would also work.


dd and photorec + testdisk.

Image the failing drive, then run recovery tools on the image.


I'm wondering if I should install the Linux Subsystem for Windows to use those tools. It looks like I'm going to need to be using Windows with these drives.

I've been using a raspberry 3 for a long time with an encrypted usb drive for storage. Like the other commenter the pi also syncs the drive to the cloud using rclone.

I encrypt stuff and just keep it in various cloud storage solutions. I wrote my own encryption program in Golang using a combination of scrypt, aes and gcm

Why don’t you use existing encryption programs?

Just backblaze. Then I consider “got it onto my pc” == “backed up” for other devices.

Git-Annex across my devices, with archives on rsync.net and my cloud VPS

3-2-1.

Time Machine (USB hard drive), NAS (Synology) and the cloud (Backblaze).


I use Backblaze

Yes, there are some challenges in backup and recovery, for ordinary data, for installed programs, for files for an operating system, etc.

I'm struggling, don't have good solutions, have a good start, but otherwise have only a work in progress.

On my HP laptop Windows 10 computer, I have the D:\ recovery partition.

Otherwise I have 3 Western Digital USB external 3.5" form factor hard disk drives. Two of the drives have 2 TB (trillion bytes) of space, and the third one, new, has 5 TB.

I back up using ROBOCOPY with some carefully selected options. Occasionally I do a full backup of my data on C:\ and frequently I do an incremental backup of that data.

By my data I mean the Windows file system directory tree rooted at directory (for Apple users, folder)

     C:\Users\user1\
This procedure does a lot of good but has some flaws: One of these is that some of the directories close to C:\ are special in Windows and don't work in the normal ways with the command line command DIR, etc. I don't understand all the problems but for one it appears that there are circular references in the directory structure that lead some software operations, e.g., part of backup, to infinite loops -- as far as I can tell currently, really big, gigantic bummer.

So, there are problems: Last week an external keyboard held down a key on my laptop, and Windows got confused and deleted several icons from my screen, desktop. Bummer.

Eventually Windows users discover (apparently secret knowledge) that each such icon is from a file of type LNK. So, some of these files are in directory

     C:\Users\user1\Desktop\
but others are in directory

     C:\Users\Public\Desktop\
This directory was hidden until I used

     attrib -H
to unhide it. Well, the LNK file for the icon for my installation of an old version of Google's Web browser Chrome was in that directory and was one of the icons and LNK files confused Windows deleted.

And the actual EXE file for Chrome was in a directory that does not play well with DIR, etc.

So, it looked like (will do better next time) I had to go to a directory

     C:\Users\user1\prog05\google\chrome\
with program

     chromesetup.exe
and run that. And that is what I did. So, I reinstalled Chrome. I don't know if what was reinstalled was the old version of Chrome I had or a newer version. I care: When I like an old version of a program, I'm very reluctant to install a new version that replaces the old version. E.g., for Mozilla Firefox, I used to like it, but the recent version I have changed the user interface (in ways I regard as silly and steps backward) and, really bad, force me to close a popup window a few dozen times a day asking me to install an new version. [I do NOT want a new version -- Mozilla, I deeply, profoundly, bitterly, hate and despise the whole idea of frequent new versions. Please, please, please STOP pestering me to get new versions.] Further, when I have Firefox save a Web page and then display the saved copy, Firefox pings me asking that I make Firefox my default Web browser which it already is. And, once again, Firefox pesters me, interrupts my work, gives me another popup that demands that I reject installing a new version of Firefox. Further, some of the Web pages saved by Firefox, Firefox won't display but Chrome and Brave will. Silly situation.

Due to such pestering, I may have to junk Firefox: I have 100,000 lines of .NET code with comments with tree names of documentation, 6000+ Web pages, and used to use Firefox and a single keystroke to display such a Web page. Maybe now the pages won't display or I will get my work interrupted by two !@#$%^&*() popup windows I HATE. Looks like I may have to junk Firefox and go for Chrome, Brave, or some such. Maybe there is a way to download and keep a version of Firefox 5-10 years old that I liked JUST FINE the way it was -- no popup hassles.

So, net, Windows getting confused due to a key held down forced me to download a new version of Chrome -- really bad bummer. And the Chrome EXE file is in a directory that does not play well with DIR, some approaches to backup, etc.

For the LNK files in directory

     C:\Users\Public\Desktop\
I've copied those to part of the file system that behaves normally and where I can back it up.

It looks like I will have to do some system management mud wrestling to get the EXE files, etc. for programs in misbehaving directories

     C:\Program Files

     C:\Program Files (x86)

     C:\Windows
to a normal behaving part of the file system where I can use ROBOCOPY to back up the files and just COPY or XCOPY to restore selected files/directories from the backup.

To me these misbehaving directories look like a grand design disaster of Windows. Windows has had 30 or so years to get such things right and still has some serious, first grade, problems. I will have to investigate to find ways to work around these Windows disasters.

Near the top of my TODO list is getting around bad or missing Windows documentation to get around some really bad Windows system management disasters.

E.g., those circular references in the file system directory tree -- no longer a tree. WHAT a bummer. Where is the documentation for how the heck to work with such directories, list them, copy them, restore them? Was it really necessary to ruin the file system in this way?

E.g., I can't do a routine backup and restore and, thus, am pushed into depending on Web sites continuing to provide the downloads I want -- years from now. Bummer.

And part of this work to be done is to understand what I can do with the HP D:\ recovery partition and how to do it.

Uh, in simple terms, I want to know how to boot from a DVD, restore all the HP and Windows stuff from somewhere, DVDs if the size is small enough, from external hard drives otherwise, and then use ROBOCOPY to restore everything else. This goal seems simple and obvious enough. Now a big detour in my work is to figure out how to do that -- knowing that good documentation will be more rare than hen's teeth. System management mud wrestling to get around problems of HP and Windows instead of my real work.

On my other computer, a desktop computer I assembled from parts, e.g., an AMD FX-8350 processor, I have lots of internal hard disk space and two new 4 TB disk drives ready to install. So so far I do a lot in internal backup.

One problem is it appears that Windows has seen some free space on one of my hard disks and decided to use it for a lot of temporary storage to be used by running, installed programs. Since that storage changes frequently, it gets backed up as part of an incremental backup and makes my incremental backups much larger than they should be. Bummer. That is just what I do NOT want. And Windows never asked me about using my disk volumes this way, and so far I can't find any documentation for how to tell Windows to put that temporary junk where I would want it to be and out of the way of my real work. Windows, messing up my real work. Bummer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: