My issue with tape is that unless your operation is big enough to justify a robotic tape library as a budget line item with support contracts and all, then you're down to paying someone (or eating the cost yourself) to physically swap tapes which is much more expensive and boring than deploying a ZFS server with gobs of disk that once configured, just sits in the rack and does its job quietly.
I think you’re dead on. I smaller scales, it’s cheaper just to keep more disks running. But once you get to scale, tape is great for archival work.
The trick I’ve always found is to figure out where you are on that inflection point. And it’s hard. Is 1PB enough to justify tape? (Which seems like a crazy question to me - I remember having megabyte sized tapes).
When I did the calculations a few years ago, break-even was somewhere around 150 TB, and I don’t think it’s changed too much since then. This is just considering the cost of drives and media. Obviously, the real inflection point is going to be different depending on all sorts of factors that may be specific to your situation and your priorities.
Usability isn’t something you can ignore, but it’s not like hard drives are perfect either—do you buy a big NAS / SAN setup and plug drives into it? Will it get full? And tape has the advantage that it’s completely immune, out of the box, to ransomware.
I think there are four cases that really scream for tape.
1. Data hoarders, who just want to store as much data as possible for the cost. There’s an r/DataHoarder subreddit if you’re curious about these people.
2. Archivists, who want to store lots of data long-term. Tape is a lot easier to store. I recommend that archivists standardize on a specific generation of tape for as long as possible and don’t mix generations (don’t mix LTO4 and LTO5, for example, despite the fact that the drives are advertised as working with both).
3. Companies with recordkeeping requirements, like SOX (Sarbanes-Oxley). Tape is just really good for that. It has a way of surviving problems in your IT department.
4. Companies with enough data that they can put a line-item on the budget for backups, and justify the operational cost of tape—support contracts, keeping staff on hand who know how to use tape, that kind of thing.
Of these, the data hoarders are going to use the 150 TB break-even point just because they want as much data as possible. Everybody else is going to make decisions based on other factors, like staffing or compliance. There are a number of gotchas, like problems with mixing tape generations, the prospect of using robotic tape libraries, and support contracts, that make the tradeoff much more nuanced.
Yes, considering the additional cost of a tape drive versus the price difference per stored TB between tapes and HDDs gives an intersection point somewhere between 100 TB and 300 TB.
Taking into account that for long term archiving it is necessary to store 2 or 3 copies of the data reduces proportionally the threshold above which tape is preferable.
The storage cost per TB includes not only the purchase price but also the lifetime of the media, e.g. a HDD model with a warranty of 2 years cannot be trusted to store data for a longer time.
Tape is guaranteed for 30 years, but the storage time is normally reduced to about 10 years by the risk that the corresponding tape drives may become obsolete.
Yeah. It gets even more complicated, because tape drives and tape media has different failure rates.
At the places where I used tape, we used a more efficient encoding for archival tape backups. Rather than storing 2 or 3 copies, we used forward error correction with something like 30% overhead. This, then, gets even more complicated to evaluate, because it multiplies how “hungry” your backup / archive system is for ingesting data in order to remain efficient & still write data out to tape by whatever deadlines you have set. If you store 3 copies of data on LTO-8, then you write data in blocks of 12TB, with one copy on three tapes. If you use forward error correction, you might do something like write out 96TB at once on 11 tapes. You use less than half as many tapes in the long run, but you need to feed the machine faster in order to meet deadlines.
While it is true that more complex error correction methods can reduce a lot the amount of tapes needed for storage, the simple approach of making 2 or 3 copies remains necessary in many cases, because storing the copies in different geographical locations is typically the only method that can protect an archive against a calamity that may destroy an entire storage center.
It is possible to use RAID-5/RAID-6-like encoding schemes that can survive the complete destruction of 1 or even 2 storage centers while using less tapes than with simple copies, but such encoding schemes can be used only by very large organizations, which use more than 3 separate storage centers.
Yes, the needs for archives (sole copy) and backups (additional copies of live data) are different.
> It is possible to use RAID-5/RAID-6-like encoding schemes that can survive the complete destruction of 1 or even 2 storage centers while using less tapes than with simple copies, but such encoding schemes can be used only by very large organizations, which use more than 3 separate storage centers.
Scenarios where two storage centers are destroyed—that’s extreme. The most paranoid scenarios I’m normally willing to entertain are along the lines of one data center burns to the ground in a generator fire, and somebody drives a truck full of backup tapes into a ditch and they’re all covered with mud and sand.
Tapes have a high enough failure rate that you benefit from forward error correction and you benefit from planning to handle individual tape failures. This includes stuff like the tape leader breaking, somebody losing a tape, damage during transport, water damage in storage, etc.
There’s a layered approach here, where you plan for different disasters at different levels of the stack. Each layer exposes some certain failure rate to the layer above it, and deals with some certain failure rate at the layer below it. When I think of backups, I often imagine a top-level data storage system that has a geographically distributed hot backup, and then an offline cold backup. This lets you survive complete destruction of one data center, or lets you survive a catastrophic software bug that destroys data (and a bunch of tapes are damaged on top of that). Pretty good baseline, IMO.
Another big case besides SOX compliance is medical records generally or imagery specifically.
Basically anywhere where you have a lot of data that has to be retained indefinitely for regulatory compliance or practical reasons is a great case for tape. But yeah, the robotic library and the service costs are pretty high until you hit a huge amount of data.
Source: supported a couple of large medical installations for a couple years many years back. Can confirm that you're dealing with a lot of mechanical complexity per GB until you get to an absolutely enormous amount of data. I genuinely can't imagine breaking even with a robotic library you can't climb into.
But presuming this isn't data with low-latency access requirements (since tape is pretty useless for that, so we wouldn't be making the comparison), what's the inflection point where it becomes worth the CapEx to justify even having your own "nearline" + archival storage cluster at all, vs. just using Somebody Else's Computer, i.e. an object-storage or backup service provider?
To me, 1PB is also where I'd draw that line. Which I would interpret as it never really being worth it to go to local drives for these storage modalities: you start on cloud storage, then move to local tapes once you're big enough.
(Heck, AFAIK the origin storage for Netflix is still S3. Possibly not because it's the lowest-OpEx option, though, but rather because their video rendering pipeline is itself on AWS, so that's just where the data naturally ends up at the end of that pipeline — and it'd cost more to ship it all elsewhere than to just serve it from where it is. They do have their self-hosted CDN cache nodes to reduce those serving costs, though.)
On the other hand, with either tape or hard drives, you can leave it on a shelf for 10 years and the data has a decent chance of still being intact. Proper procedure would dictate more frequent maintenance, but if for whatever reason it gets neglected, there's graceful degradation. With AWS, if you don't pay your bills for a few months, your data goes poof. Other companies might have more friendly policies, but they also might go out of business in that span of time.
I think someone else mentioned in this very comments section that hard drives "rot" while spun down — not the platters, but the grease in their spindle motor bearings or something gets un-good, so that when you go to use them again, they die the first time you plug them in. So you don't want to use offlined HDDs for archival storage.
(Offlined SSDs would probably be fine, if those ever became competitively affordable per GB. And https://en.wikipedia.org/wiki/Disk_pack s would work, too, given that they're just the [stable] platters, not the [unstable] mechanism; they would work, if anyone still made these, and if you could still get [or repair] a mechanism to feed them into, come restore time. For archival purposes, these were basically outmoded by LTO tape, as for those, "the mechanism" is at least standardized and you can likely find a working one to read your old tape decades later.)
Even LTO tape is kind of scary to "leave on a shelf" for decades, though, if that shelf isn't itself in some kind of lead-lined bunker, given that stray EM can gradually demagnetize it. If you're keeping your tapes in an attic — or in a basement in an area with radon — then you'd better have encoded the files on there as a parity set!
I think, right now, the long-term archival media of choice is optical, e.g. https://www.verbatim.com/subcat/optical-media/m-disc/. All you need to really guarantee that that'll survive 50 years, is a cool, dry warehouse that won't ever get flooded or burnt down or bombed [or go out of business!] — something like https://www.deepstore.com/.
But if you're dealing with personal data rather than giant gobs of commercial data, and you really want your photo album to survive the next 50 years, then honestly the only cost-efficient archival strategy right now is to keep it onlined, e.g. on a NAS running in RAID5. That way, as disks in the system inevitably begin to die or suffer readback checksum failures, monitoring in the system can alert you of that, and you can reactively replace the "rotting" parts of the physical substrate, while the digital data itself remains intact. (Companies with LTO tape libraries do the same by having a couple redundant copies of each tape; having their system periodically online tapes to checksum them; and if any fail, they erase and overwrite the bad-checksum tape from a good-checksum copy of the same data — as the tape itself hasn't gone bad, just the data on it has.)
Paying an object-storage or backup service provider, is just paying someone else to do that same active bitrot-preventative maintenance for you, that you'd otherwise be doing yourself. (And they have the scale to take advantage of shifting canonical-fallback storage to being optical-disk-in-a-cave-somewhere format — which reduces their long-term "coldline" storage costs.)
Instead, you're just left with the need to do the much rarer "active maintenance" of moving between object-storage providers as they "bit-rot" — i.e. go out of business. As there are programs that auto-sync between cloud storage providers, this is IMHO a lot less work. Especially if you're redundantly archiving to multiple services to begin with; then there's no rush to get things copied over when any one service announces its shutdown.
That’s a really good point. For us (near that 150-300TB inflection point for archival storage), it made more sense to put the data on S3 glacier. First off, the data is originally transferred through S3, but mainly, glacier hits the same archival requirements as tape, at a pretty compelling cost.
Yeah, but that ZFS server is likely in the same rack that's going to get knocked out when the data center gets sucked into a tornado, or whatever. Weather it makes sense to have someone on staff swapping tapes, rotating backups, shipping them offsite, and keeping records of what exactly is where not to mention handling AND TESTING restores really depends on what your data's worth. For most places, it's cheap insurance.
More realistically you are deploying two of the servers; redundancy is not backup. And now you have two servers to administer.
The whole point of something like tape is having an off-line copy of your data, ideally in a separate physical location. A second server with a bunch of disk in the same location connected to the same network isn't that, and will not save you from ransomware or natural disaster.
If you use even crude tools such as clusterssh, managing a bunch of machines isn’t linearly harder than managing one.
While tape renders itself easily to be offline storage (it’s offline as soon as you eject it after all), you can have that with remote servers that pull data to back it up instead of receiving pushed data. If no server can push data to any other server, only pull from them, a ransomware attack becomes a lot harder.
Also, while tape in a warehouse (or on the desk) is offline, tapes in the robot are no more difficult to destroy than hard disks. They are just slower.
Its also the software to run the blasted thing as well. As soon as you get into tape shit get's enterprise-y real quick. There are opensource tools to manage tape collections, but its not fun.
LTO tape libraries are fairly cheap to pick up second hand, its the cost of getting the newer drives that hurts.
There are tape libraries as small as 3U with 25 slots. That’s a capacity of only 300 TB with LTO-8. It’s not hard to justify if you’re working with stuff like video.
My issue with tape is that unless your operation is big enough to justify a robotic tape library as a budget line item with support contracts and all, then you're down to paying someone (or eating the cost yourself) to physically swap tapes which is much more expensive and boring than deploying a ZFS server with gobs of disk that once configured, just sits in the rack and does its job quietly.