Hacker News new | past | comments | ask | show | jobs | submit login

I admire the goal of reliable long term data storage. I don't need to store tweets or 5TB videos, but my current solution is duplicate DVDs with a bunch of par2 files to hopefully ward off bit rot.

I feel like there's some room for improvement.

Hook up ZRaid SSDs on a low powered machine (like raspberry pi). Set up regular ZFS scrubbing, and connect that to a monitoring service.

You're not going to prevent silent bitrot no matter what modern technology you use, so take a proactive approach instead to prevent data loss.

Agreed. I used to back up onto DVDs, creating a set of (say) 12 DVDs with an extra couple generated by par2, to take care of the case where some of the DVDs just straight aren't readable.

However, I found that I had a lot of data to back up, and it was actually cheaper and less tedious to get 4TB USB hard drives for ~£100 each, and plug them into an old defunct EeePC901 (with the added advantage that if the power goes out it has a battery).

My main PC has an SSH private key that lets it access a restricted shell on the EeePC that only allows it to give it files to store. That way, if a hacker breaks into my internet-facing machine, all they can do to my backups is fill the disc up, not delete or access anything. I have a process on the EeePC that regularly scrubs the par2 files, and the hard drives (I have two so far) are formatted with BTRFS, so given all the data is regularly read by the scrub process, that should notice any drive failures. My main PC uses ZFS, so I have safety in variety.

I also have an off-site backup stored on an encrypted USB hard drive in my locked locker at work, which isn't updated as regularly. My internet connection is slow, so I use the rsync --only-write-batch trick, and then carry the large update file to the backup on my laptop.

What could possibly go wrong?

Re: 4Tb drives, I do the dollars-per-mb calculation before buying hard drives. The most recent time I included the enclosure cost and found that it was actually cheaper to go huge. Granted the enclosure was a Synology, but buying 16TB drives is the closest I’ve been so ‘solving’ storage in a long time. Formatting them and adding them to the array was brutal, and they are noisy, but it has been worth it.

Yeah, I'm using 4TB laptop-sized (2.5") external USB drives, because they are small, cheap, quiet, and they don't need a secondary power supply. At the time of purchase, they were the sweet spot, with 5TB drives considerably more expensive. That may have changed now.

If you're going for 3.5" drives, then yes I can well believe that the sweet spot is with slightly larger drives, especially if you take enclosures into account. I did the calculations for work a while back for shoving hard drives into something from https://www.45drives.com/ and it seemed that getting the largest drives possible was the best price/capacity option.

Buying enough 16tb drives for an efficient raid array is an expensive way to save money.

Something that's easy to overlook with larger drives is that their rebuild times are worse.

"Shucking" drives throws the economics way off even if it means having to do some hacks and losing warranty... Usually the drives that come in enclosures are smaller.

> Buying enough 16tb drives for an efficient raid array is an expensive way to save money.

A lot of ways to be efficient with money start by having or using a lot of it:)

You can buy a 2tb drive for $100. $200 and you've got 430 DVD's worth of data redundantly stored. $300 and you've got local redundancy AND offsite backup.

Even if magnetically they don't have 'bit rot', they use bearings where the lubrication can dry up and wear out when they're not turning for long periods of time.

You need to keep them spinning on a regular basis, and replace them as they begin to fail.

HDDs are also prone to silent bitrot, where it will simply return incorrect bytes for a sector, even without any smart errors. (Optical disks also bitrot; but so does HDDs).

This is usually a precursor to SMART errors happening in the near future, but unfortunately, it can still result in corrupted replication and corrupted backups; as your backups would be backing up the rotten (corrupt) data.

I've witnessed this happen on both Seagate and WD drives, on systems with ECC memory. I can only suspect this is due to HDD manufacturers wanting to reduce their error rates, and RMA rates: it may happen with the ECC bits in a sector is corrupt, making bitrot undetectable. Instead of giving an error (and being grounds for a RMA replacement), the HDD firmware may choose to return non-integrity-checked data; which would usually be correct but also could be corrupt.

It's why filesystems like ZFS and btrfs are so important.

My rough estimation of this, based on my own experiences and those on r/DataHoarder, suggests 1 hardware sector (4KB for most drives post 2011) will silently corrupt per 10TB year. Such corruption can be detected via checksumming disc systems like ZFS.

Usually, the whole sector is garbage, which is not indicative of cosmic ray bitflips.

External flash memory storage like USB sticks and SD cards fare far worse. In my own experience, silent corruption occurs more like 1 file per device, per 2-3 years; irrespective of the size of memory. I've had USB sticks and SD cards return bogus data without errors, so often. I only know because I checksum everything, otherwise I would have thought the artefacts in my videos or photos came with the source.

If, in 2020, you are not using ZFS or btrfs for long term archival, you are doing something wrong.

ext4, NTFS, APFS, etc may be tried and tested, but they have no checksumming, and that is a problem.

Interestingly, on my home ZFS raidz with 3 4TB hard drives, I have had to replace a drive a couple of times because ZFS scrub was reporting silent corruption. They were consumer-grade SATA drives.

However, at work, I have backed up ~200TB of data to a large server with RAID-6 and ext4, storing the backups as large .tar files with par2 checksums and recovery data, and regularly scrubbing the par2 data. I have yet to see any corruption whatsoever. These are enterprise-grade hard drives. This is the strongest evidence I have yet seen that the enterprise-grade drives are actually better than the consumer-grade ones, rather than just being re-badged.

Enterprise drives have different firmware, especially from an ECC and integrity perspective. From a price/perf standpoint tho, shucking consumer grade drives with ECC win.

Thanks. What are the drives at your workplace?

I actually have no idea. I didn't have any part in purchasing that particular system, I don't have root, and all the drives are hidden behind a RAID controller. Sorry.

How do you know they are enterprise drives then?

I have a home "NAS" (opensuse server) where my main /data partition is xfs, but it mounts a btrfs backup partition, rsyncs, and takes a snapshot.

I should really get around to converting the main drive to btrfs, but this works well.

Do proper use of ZFS also require ECC memory?

ZFS protects you from disk errors. ECC protects you from memory errors. Using one or the other is safer than using neither. Using both together is even safer.

100% yes. With non-ECC you will always have bad RAM bits eventually. With ZFS this is especially bad because it can corrupt your checksums or your ZFS metadata, which means either silently corrupting your data, or corrupting ZFS itself and losing your entire zpool (akin to losing a RAID array).

Maybe not: that ZFS needs ECC is "common wisdom", but the disaster scenario appears not so likely. See



This is FUD perpetuated by a certain individual on the FreeNAS forum.

Ideally you would temporally separate the purchasing of the drives and make sure to hook them up and check them every so often (once a year?).

Much like other commenters I'm no expert on the topic, but I think you'd have to be pretty incredibly unlucky to have a mechanical failure on 3 drives at once from lack of use, especially if they were from different manufacturing batches.

I'm no expert on the issue so correct me if I'm wrong, but I've heard modern HDDs use fluid bearings and isn't susceptible to drying up.

My setup (laptop w8.1pro): external 4TB disk, assigned to letter L (for Library). I got Acronis running once per month, dumping a 70-80GB .tib file on L. L also has a backup folder with everything I got (setup files, books, photos, every audiobook/video I need such as trainings, etc). The whole backup is ~2TB.

Now get Carbonite (not affiliated, I just like the infinite space backup), and get it to backup your key laptop folders (Docs, Images, Desktop, etc) and your L-drive. I don't remember how much it costs ($6-10?/mo), but I have stopped worrying since then. I got a monthly tib file for my system and an "instant" backup for everything else. So even if my laptop is stolen I can set up a new laptop (the .tib may be useless but I can open it and see what s/w I had and I can take the config files/folders to move to my new system).

I don't remember how much the disk was but it didn't hurt my wallet, and the ~$100 (?) per year on Carbonite (had CrashPlan) definitely doesn't hurt my wallet.

If you continue your DVD method you may also want to look into "dvdisaster" https://en.wikipedia.org/wiki/Dvdisaster

Anyone who uses par2 to protect a filetree, I wrote a small utility to help you maintain that filetree (e.g. bulk verify/create/status) [1].

[1] https://github.com/brenthuisman/par2deep

Do you store your par2 files on different discs from the data they refer to? Do you have multiple copies? Do you store the discs in a cool, dark place when you're not using them? Do you have another form of backup, preferably offsite?

If you do all these things, I think that's about the best you can do with optical media.

Haha, I keep the par2s on the same disc, and then I make multiple copies of the same disc and they're in a cardboard box in my basement somewhere.

The ink on writable optical disks fades over time... So of all the copies were written at the same time, they might all stop working at around the same time and cause problems :)

Yeah if you're really taking this seriously it's worth investing in an LTO-4 drive and some tapes.

There might be a better medium available nowadays but if I seriously wanted to have a piece of data fifty years from now that's where I'd start.

Really, any second form of backup would be a good idea. Preferably, it should be stored offsite and separate from any other backups.

You can always try an affordable online service like Backblaze B2 as one of your options. I haven't checked lately but I think they cost around $5 per terabyte per month. Plus usage, but that would be minimal for an archive.

Your credit card may expire or get blocked, and you might be in hospital, or billing alert emails may simply go to your spam folder for 45 days. Your data will be irreversibly deleted by Backblaze.

There are LOTS of failure cases with any cloud provider, especially one with a crazy policy of deleting data in just 45 days.

There is at least 1 reddit post a month about how someone lost data with Backblaze. Their reddit support rep is never able to do anything about it, other than "sorry, we will take on board your feedback".

For comparison, if your Google Drive subscription lapses, Google stops you from uploading but will not delete your data.

This. I was backing up an external drive on a very very slow connection. I plugged it in to the machine late in the month, but forgot to turn the power on at the wall. The upload didn’t start and I lost the full backup and had to Re-upload the whole drive.

A good lesson was learned but it hurt. The upload took weeks to complete.

Most BD-R discs are physically higher quality than DVD-R's. They can be comparable to archival-quality M-Disc media.

I suggest trying syncthing. It is a cross platform P2P file synchronization tool, super easy to setup.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact