My understanding is that the only reliable way of long-term digital archival storage is to refresh the media you are storing things on every few years, copying the previous archives to the fresh storage.
Since storage constantly gets cheaper, 100GB first stored in 2001 can be stored on updated media for a fraction of that original cost in 2024.
Long term archival is successive short/middle term archival.
I think I read this quote on Tim Bray's blog[0], but I am not sure anymore. This is now my approach, my short/middle term archival is designed to be easily transferred to the next short/middle term store on a regular basis. I started with 500GB drives, now I am at 14TB.
My first hard drive was 5Mb, and I had to write my own driver for it (PDP11, c. 1982). It was a hell of a step up from 8" floppies, enough so I partitioned it into 8 separate areas.
Even the floppies were a step up from paper tape - the older guys used to have a cupboard of paper tapes on coathangers, and linked their code by feeding the tapes through the reader in the right order.
Pretty much. You see hobbyists getting data off of 30+ year old hard drives for the novelty of it, but I can’t imagine relying on that as a preservation copy. Optical media rots, magnetic media rots and loses magnetic charge, bearings seize, flash storage loses charge, etc. Entropy wins, sometimes much faster than you’d expect.
Sometimes they fail for other reasons as well, such as improper storage.
Back in the 90's to 00's a friend had a collection of cd's that he'd written, but he stored them in a big sleeved folder container. The container itself caused them to warp slightly, which made them unusable.
I took a few for testing and managed to unbend them after some time, which turned them back into a working state.
[Note: That's the most apostrophes I've ever used in a sentence, it feels dirty]
Yeah I didn't want to use heat in case that did more damage, so I just used some weights I had lying around.
If I remember right I just stacked a few starting on the floor with a protective layer in-between so as to not scratch them (a piece of paper is fine). Then add a 5kg or whatever weight on top. After a few days I turned them over and did the same again.
After that most were flat, only one or two needed some more individual time. I imagine that if after that if they're still not flat them maybe heating them slightly in the oven or even just the sun outside might do the trick.
I have been working in long term storage for many years. I never understood why we cant just 3d-print binary code on thin clay tablets and then burn them for long term storage. Clay tablets are readable for thousands of years.
Stone - and more usefully, clay - last almost forever, but they're impractical for digital storage, since there's no useful way of imprinting on them that isn't very low density, unlike paper.
Unless we could improvise something with old-school dot matrix impact heads to print on clay - I wonder if anyone has tried.
1 line of 80col text per card is pretty awful though, and then you bring back all the horrors of 60s card sequencing but for bigger files.
Some kind of 'barcode' encoding, with heavy error correction, would probably be better. I've seen attempts that claim 500kB per side using a largely unmodified QR code system, but I suspect better could be achieved - the method of scanning is probably going to be the bottleneck anyway.
Just one article discussing it. Do you have a source to back this up? M-DISCs are getting hard to purchase these days, and I have a lot of stuff I want to put on them which I likely will want to look at in 30 years.
Mdisc uses a special very hardened layer thats set it apart from other discs. That is why long term storage works so well.
"Instead, the M-DISC™’s data layer is composed of rock-like materials known to last for centuries. The M-DISC READY™ Drive etches the M-DISC™’s rock-like layer creating a permanent physical data record that is immune to data rot. The stability and longevity of the M-Disc DVD has been proven in rigorous tests conducted according to the ISO/IEC 10995 test standard for determining data lifetime of optical media."
interestingly this is how long term cold tape storage works more or less (in case of taps you have a bit different failure characteristics so it's more like "check read" at least every "some_time" and on checksum errors rewrite to new tape restoring from "raid" duplicates, but conceptual it's kinda the same idea)
I don't think I'd want to trust tape for more than a decade or two though, as we've seen with audio tapes, iron separation from the substrate becomes an issue a lot sooner than we'd like.
Tape also has a problem shared with hard disks - to achieve high density we've rapidly hit a stage where the technology is too complex to enable data archeology at some point in the future; using 90s era complexity hard drives is about where the archeological limit is. LTO-1 may even be beyond that complexity compared to DLT, QIC or even Data8 (helical scan may be too much of a spanner in the works)
Modern polymers may make microfiche/microfilm a longer term solution than it has been in the past with acetate film/slides, but I'm not sure how much research has been done into which polymers might be best.
For longer than a century, our best experience is, thus far, with clay and paper (assuming good quality acid-free paper, rather than cheap modern consumer paper).
As someone who runs an archive for a 90s radio show, I have to contratict partly. I regulary get tapes that are 30-35 years old. The quality of those audio tapes is shockingly good. The old chrome audio tapes very rarely show signs of degradation compared with audio CDs from the same time period.
Mid to late 90s we also got the metall audio tapes that are even better. I got a few DAT tapes from the early 90s that are in same league as metall audio tapes.
Microfiche development for its own sake has mostly disappeared.
But we have a lot of great new substrates used in displays. A well produced acetate sheet may still be the best for the range of properties including aging, ink adhesion, etc. I no longer remember if microfiche is made by printing or by photo-development.
I have multiple 5tb external disks attached to my main tower that (among other things) serves up Plex content. I switch each one out every year, for equivalent of about a hundred dollars each. I try and find a compromise between the amount of read requests and availability for these disks, but in the end, if they're read often enough, they die soon enough.
What killed the last one was an experiment with installing Emby. Like many similar systems, it bewilderingly has no rate-limiting function, and will thrash a disk to within an inch of its life in order to index it. And that was the most recent thing that killed one of my external Plex drives, with multiple series and movies on it.
So yes, just keep refreshing the media, at reasonable intervals.
PS Yes, I know this is a poor method of content storage. NAS is looming up for me one of these days.
If it does't have to be offline for long durations, software raid + adding a new drive every once in a while, and discarding failing drives is pretty foolproof.
AFAIK large data centers automate something like this.
The issue with (software) raid is you have no idea if what you're copying isn't actually corrupted. If the filesystem isn't checksumed there's no guarantee.
Related, CD-Rs. When I left my submarine in 2013, they (by which I mean the entire Virginia class) were still using them to store archived logs, despite my explanation that they’d be lucky to get a decade out of them. The first chosen storage location was literally the hottest part of the engine room, right in between the main engines. Easily 120+ F at all times. After protest, we moved ours to a somewhat cooler location. Still hot, and still with atmospheric oil and other fun chemicals floating around.
I look forward to the first time logs from a few decades ago are required, and the media is absolutely dead.
EDIT: they weren’t even Azo dye, they were phthalocyanine. A decade was probably generous.
I was curious how some of the more wealthy yacht owners solved the marine puzzle. What kind of computer would they use? What kind of parts would go in? What would a basic system cost? So I asked one, he opened up a compartment with a stack of cheap Acer laptops vacuum sealed in bags. They last 2 to 6 months, when they stop working he throws them away. The sealed one has everything installed, a full battery and will sync as soon as internet becomes available. When plugged into something the new laptop is never the problem. He spend a small fortune arriving at this solution.
Looked at doing this on a moderate sized yacht, before decent chartplotters were reasonably priced. We wanted to have a decent spec PC drive a chartplotter at the helm, nav table, TV in the living area. To build a PC that would survive was near impossible for a performant system. We were looking at fanless sealed industrial machines.
Everything. Someone could bring a lawsuit up years later, and logs would be necessary to determine if they had standing.
The aforementioned optical media storage was specific to the nuclear reactor and electric plant; I think everyone else’s were stored differently. Not positive.
EDIT: sibling comment below mentions performance data. Yes, that too. I graphed (nuclear) fuel consumption on one underway, and was surprised to find it didn’t match expected. My Captain was also surprised, and thrilled, because it meant he got to be more important (fair enough; who doesn’t want to be listened to by their boss?)
> Knowing nothing of submarines or seafaring, I'm genuinely curious as to what is logged on a ship that may be necessary a decade later?
I noted in another comment that the National Archives say only "deck logs" are retained permanently, and it looks like this site lists what they contain: https://www.history.navy.mil/content/history/nhhc/research/a..., which includes all kinds of things.
Stuff like "Actions [combat]", "Appearances of Sea/Atmosphere/Unusual Objects", "Incidents at Sea", "Movement Orders", "Ship's Behavior [under different weather/sea conditions]", "Sightings [other ships; landfall; dangers to navigation]" seem like they'd be useful for history and other kinds of research.
Stuff like "Arrests/Suspensions", "Courts-Martial/Captain's Masts", "Deaths" seem like the kind of legal records that are typically kept permanently.
Stuff like "Soundings [depth of water]" were probably historically useful for map-making.
Good question. As far as I'm aware, outside of a few special circumstances (like birth records), the vast majority of legal record preservation requirements are seven years with some being as long as ten years. Of course, with the service life of some military ships and aircraft now being stretched >30-40 years, I can imagine it might be useful to have records of component failures and replacements, especially for statistical modeling.
In the case of a ship (or sub), I'd assume that they'd rotate optical media archives off the vessel every year or two and transfer them to some central database. After all, a vessel can be lost and the data is also useful in the aggregate.
> outside of a few special circumstances (like birth records)
Also IIRC several US states require medical record retention for minors of up to one year past whenever they become adults, so that's a potential 19 years there.
No idea what they actually log but it seems like performance data from the propulsion system under a wide range of conditions would be useful when designing the next generation of such systems.
1. Incomplete copies with missing dependencies.
2. Old software and their file formats with a poor virtualization story.
3. Poor cataloging.
4. Obsolete physical interfaces, file systems, etc.
5. Long-term cold storage on media neither proven nor marketed for the task.
Managing archives is just a cost center until it isn't, and it's hard to predict what will have value. The worst part of this is that TFA discusses mostly music industry materials. Outside parties and the public would have a huge interest in preserving all this, but of course it's impossible. All private, proprietary, copyrighted, and likely doomed to be lost one way or another.
Related documentary that comes to mind: Digital Amnesia (2014) [1]
It broke my heart seeing those librarians in disbelief when their national library was sold off to the highest bidder. When they said "It seems our country does not value our own culture anymore".
Books lasted hundreds of years. Good luck trying to read a floppy from the 90s, or even DVDs that are already beyond their lifetime and are a very recent medium.
It gets worse when you read the fine print of the SSD specifications, wherein they state that an SSD may lose all its data after 2 weeks without power, and data retention rates are at less than 99%, meaning they will degrade after the first year of use. And don't get me started on SMR HDDs, I lost enough drives already :D
Humanity has a backup problem. We surely live in Orwellian times because of it.
The way I remember it, if you tried to read a floppy from the very early 90s, or from the 80s, you'd probably have no trouble at all, even many years later. You can probably still read floppies from the 80s without issue.
However, if tried to read a floppy from the late 90s, or 2000s, even when the floppy was new, good luck! The quality of floppy disks and drives took a steep nose-dive sometime in the 90s, so even brand-new ones failed.
This. I have a few hundreds of 80s floppies (especially the less popular 3 inch CF2 format), and some from the 90s. They read well, at least as long as you don't leave them in the drive when idle (the magnetic head may affect them!). But the last decades of floppies were of horrible quality. I remember them failing after a month.
The vast majority of my 5 1/4" floppies (including "HD" 1.2MB ones) read just fine still. The vast majority (just about 100%) of my old 3.5" HD (1.44MB) floppies are unreadable. The 3.5" 720kB ones are mostly ok. Stored under the same conditions.
Also good luck trying to find a floppy drive. Yes, I'm sure you can buy one now but five or ten years from now? I'd say manufacture of the drives isn't exactly a booming business.
Younger me thought I was smart repurposing my SSDs as shockproof 2.5" external backup drives. Suffice it to say that I was a year abroad, and came back to losing all my data because of it. I was able to recover some parts, but most of it was gone.
Only buying CMR surveillance-class HDDs now for my backups. They're limited to 8TB for a 3.5" sized HDD, but that's far better as a compromise than the nightmare of losing all digital copies of tax documents that you have to keep - mandated by law - for at least 10 years.
When that happened I had to renew my ID, and had to find the original birth certificate in paper form in the hospital's paper archive to get a signed copy, and had to go there with multiple relatives to prove my identity. Just to get my ID renewed. That incident surely made me realize how important backups are.
> Only buying CMR surveillance-class HDDs now for my backups.
I have no way to confirm if this is true, however someone years ago told me that surveillance drives have inferior error correction and are optimized to not block recording in case of errors because they work on the assumption that you are storing videos, therefore if you miss a frame that's not a big problem if the rest of the file is fine. That of course would be a big issue in case of other data, especially if compressed. Again, no proof of that, but it's enough for me to avoid them for archival until I can be sure the above turns out as untrue.
On the flip side I built a PC back in 2016 (with two SSD's) which went into storage about 6 months later. I just got that PC back this year and it booted up just fine with all the data intact.
LTO-1 started in 2000 and the current LTO-9 spec is from 2021. But it only has backwards compatibility for 1 to 2 generations. You can't read an LTO-6 tape in an LTO-9 drive.
> Sticky-shed syndrome is a condition created by the deterioration of the binders in a magnetic tape, which hold the ferric oxide magnetizable coating to its plastic carrier, or which hold the thinner back-coating on the outside of the tape.[1] This deterioration renders the tape unusable.
Stiction Reversal Treatment for Magnetic Tape Media
> Stiction can, in many cases, be reversed to a sufficient degree, allowing data to be recovered from previously unreadable tapes. This stiction reversal method involves heating tapes over a period of 24 or more hours at specific temperatures (depending on the brand of tape involved). This process hardens the binder and will provide a window of opportunity during which data recovery can be performed. The process is by no means a permanent cure nor is it effective on all brands of tape. Certain brands of tape (eg. Memorex Green- see picture below) respond very well to this treatment. Others such as Mira 1000 appear to be largely unaffected by it.
Data migration and periodic verification is the answer but it requires more money to hire people to actually do it.
I've got files from 1992 but I didn't just leave them on a 3.5" floppy disk. They have migrated from floppy disk -> hard drive -> PD phase change optical disk -> CD-R -> DVD-R -> back to hard drive
I verify all checksums twice a year and have 2 independent backups.
I have restored a few hundreds LTO-1 and 2 tapes using an LTO-3 drive a few years ago. If you keep the drives around and run Linux (which supports obsolete hardware better), keeping LTO tapes 10, 15, 20 years is not a problem at all.
A few weeks ago I wrote for a customer a restore utility for LTO-4/5/6 made with a now-deceased archival system from a deceased software company. Most of these tapes are up to 16 years old, have been kept in ordinary office cupboards, and work perfectly fine.
But you're right that archival isn't much about the media, but is a process. "Archive and forget" isn't the way.
While tape does not last forever, the LTO tapes are specified for at least 30 years.
The more serious problem is as you say that the older drives become obsolete. Even so, if you start using an up-to-date LTO format you can expect that suitable new tape drives will be available for buying at least 10 years in the future.
For HDDs, the most that you can hope is a lifetime of 5 years, if you buy the HDDs with the longest warranties.
I absolutely agree that the average tape will last longer than the average hard drive.
I've got 30 hard drives in use right now and at least 10 are older than 5 years. A few are over 7 years old. I've also had hard drives die in less than a week.
Even if the data is on tape I want to emphasize that the tape needs to be periodically read and verify that the data is still readable and correct. Assuming the data is stable for 30 years and you can just leave it there is a dumb idea unless you didn't care about the data in the first place.
The same is true for HDDs or for any other data storage devices.
I have stored data for more than 5 years on HDDs, but fortunately I have been careful to make a duplicate for all HDDs and I did not trust the error-correction codes used by the HDDs, so all files were stored with hashes of the content, for error detection.
On the HDDs on which data had been stored for many years, only seldom I did not see any error. Nevertheless, I have not lost any data, because the few errors never happened in the same place on both HDDs. The errors have been sometimes reported by the HDDs, but other times no errors were reported and nonetheless the files were corrupted, as detected by the content hashes and by the comparison with the corresponding good file that was on the other HDD.
I also make checksums for every file and verify twice a year. Over millions of files totalling 450TB I end up getting about 1 failed checksum every 2 years. If you are having more frequent checksum failures I would check for RAM errors first.
zfs and btrfs would do this automatically and have built in data scrub commands.
I have a checksum, uh, dream for lack of a better word, but I fear I lack the talent to pull it off. The problem with a checksum is that it only tells you that an error exists (hopefully), but it does not tell you where.
Imagine writing out your stream of bytes in a m x n grid. You could then make checksums for each 1 through m columns, and 1 through n rows. This results in an additional storage of (m + n) checksum bytes. A single error is localized as an intersection of the row and column with bad checksums. One could simply iterate through the other two hundred fifty-five possibilities and correct the issue. Two errors could give two situations. The most likely is four bad checksums (two columns, two rows) and you could again iterate. The less likely is three bad checksums because the two errors are in the same row or column.
I ran the math out for the data being rewritten as a volume and a hypervolume (four dimensions). I think the hypervolume was "too much" checksum, but the three-d version looked ... doable.
Someone smarter than I has probably already done this.
Everyone here is talking about the life of either drives or and media.
The key point ought to be that data must be regenerated within a specified period to avoid bit rot (through loss of remanence).
No matter how many backups you have if they're all made at the same time then they should all be regenerated at the same time (and within the storage safety margin period).
Policy at $Job - all important data is backed up to a rotation of high-quality hard drives. Which are stored off-site, powered down. Every N weeks, each one of them is powered up (in an off-line system) and checked - both with the SMART long test, and `zfs scan` (which verifies ZFS's additional anti-bit-rot checksums for the data).
Yes, it's a bit of a PITA. OTOH, modern HD's are huge, so a relative few are needed. And we've lost 0 bits of our off-site data in our >25 years of using that system.
This article is too vague. It sounds like they're talking about the physical drive not working, but they're giving examples where you can't playback because you need to install the correct old software, plugins, etc... Which doesn't have anything to do with hard drives.
So what's actually wrong with hard drives for archival? Do they deteriorate? Do they "rot" like DVDs/blurays/etc have been known to do? Or is this just an ad for their archival service?
Magnetic HDDs do suffer bit rot, yes. But perhaps more importantly the mechanisms suffer physical failure over time. You can't just pop the platters into a new drive, even if you had an identical model.
That's really the main disadvantage of hard drives: the media is permanently coupled to the drive. If your tape drive fails, you can just pop the tape into a working drive and still get your data back.
With older drives, pre-about-2010 I think, you can, as I understand it.
After that they added little NVRAM chips to the boards which hold data about the disk, so you need to make sure they match. I just fried a HDD controller with a bad SATA cable, so I'm having to switch the chip from one board to another to try to recover the data.
Oh dear, here we're talking about age deterioration. Swapping media as a solution only works if the remanence hasn't decayed below the recoverable threshold.
All magnetic media suffers bit rot by remanence decay (a natural analog phenomenon). It's just that tape by virtue of its construction and type of recording process has better data (S/N) margins (its storage longevity is better).
Hard drives are known to suffer "sticktion" where the heads get stuck to the platter and either the drive won't spin up or it spins up and the heads damage the platters. I imagine hard drives could also have bad capacitors but I haven't heard of that happening.
In the old drives of 5" HDDs the head stepper motor shaft was external, and if a drive got stuck a slight twist of the stepper shaft would unstick the heads after which the drive would spin up (well, as long as it didn't rip heads off the HDA so it was always a calculated risk).
Happened to me when I got a call out to a large UK outfit who'd have an extended power cut and knew recovery was going to be fun. First stop was a particularly critical PC which had exactly this problem, so open the case, touch the HDD just right and off it went - happy with that, and to the next item.
Anyway, the recovery operation went well, and this particular incident came floating back by way of a hushed comment from a manager a few years later about this tech who'd come in to help with the recovery, and who'd "...laid his hands on the PC, and it came back to life!" :)
>I imagine hard drives could also have bad capacitors but I haven't heard of that happening.
That's very unlikely. If you're thinking of the "capacitor plague" of the 2000s, that only affected electrolytic capacitors, since it was caused by the Chinese poorly copying the formula for capacitor electrolyte. I don't believe hard drives used electrolytic capacitors in that time period, simply due to their size, though I could be wrong.
That seems like it could be solved by (carefully) disassembling the drive for long term storage, adding a thin piece of paper or tape under the heads, etc.
Disassembling a drive just allows dust to get in and cause more damage. Come to think of it, in recent drives the heads fully unload onto a ramp so they're probably less likely to stick.
What's the scenario where you cannot take the old 1990's hard drive and back up its data in multiple cloud service providers cold storage (Azure/AWS/GCP) and have to keep the obsolete physical media on hand?
I'm struggling to understand why these miles of shelves filled with essentially hardware junk haven't been digitized at the time when this media worked and didn't experience read issues.
The article doesn't really provide an explanation for this other than incompetence and the business biting off more than it can reasonably chew. I'd be furious if I paid for a service that promised to archive my data, and 10-15 years later told me 25% of it was unreadable. I mean it's not like it was a surprise either. These workflows became digital 2-3 decades ago. There was plenty of time to prepare and convert this.
That's kind of what I'm paying you for.
As always, seems like the simple folk of /r/datahoarder and other archivist communities are more competent than a legacy industry behemoth.
> I'd be furious if I paid for a service that promised to archive my data, and 10-15 years later told me 25% of it was unreadable.
The article is very vague on this, but I thought this company was first doing something like a bank safety deposit box. Send us your media in whatever format and we will keep it secure in a climate controlled vault. They don't offer to archive your data, they offer to store your media. Now it seems they pivoted to archiving data. This is an ad for their existing media storage clients to buy their data archive service:
> Iron Mountain would like to alert the music industry at large to the fact that, even though you may have followed recommended best practices at the time, those archived drives may now be no more easily playable than a 40-year-old reel of Ampex 456 tape.
They did make this pivot several years ago, with big upsell and a huge internal product advancement to offer housebuilt eDiscovery for lots of data types. I was at Google Cloud when they did the first big deal around this a few years ago: https://www.ironmountain.com/resources/blogs-and-articles/f/...
It's not a matter of incompetence, it's a matter of being very, very cheap.
Artistic endeavors are a unique blend of "extremely chaotic workflows nobody bothers to remember the moment the work is 'done'", "90% of our output doesn't recoup costs so we don't want to burn cash on data storage", and "that one thing you made 20 years ago is now an indie darling and we want to remaster it". A lot of creatives and publishers were sold on the promise of digital 30-odd years ago. They recorded their masters - their "source code" - onto formats they believed would be still in use today. Then they paid Iron Mountain to store that media.
Iron Mountain is a safe deposit box on steroids, they use underground vaults to store physical media. You store media in Iron Mountain if you want that specific media to remain safe in any circumstance[0], but that's a strategy that doesn't make sense for electronic media. There is no electronic format that is shelf-stable and guaranteed to be economically readable 30 years out.
What you already know works is periodic remigration and verification[1], but that's an active measure that costs money to do. Publishers don't want to pay that cost, it breaks their business model, 90% of what they make will never be profitable. So now they're paying Iron Mountain even more for data recovery on the small fraction of data they care about. The key thing to remember is that they don't know what they need to recover at the time the data is being stored. If they did, publishers wouldn't be spending money on risky projects, they'd have a scientific formula to create a perfect movie or album or TV show that would recoup costs all the time.
[0] The original sales pitch being that these vaults were nuke-proof.
[1] Your cloud provider does this automatically and that's built into the monthly fees you would pay. People who are DIYing their storage setup and using BTRFS or ZFS are using filesystems that automate that for online disks, but you still pay for keeping the disks online.
It depends on what specifically Iron Mountain is selling you. A place to store your physical data device or are they promising to keep your data available? The former sounds cheaper and easier for Iron Mountain. Given Iron Mountain started in the 1950s, redundantly backing up customer data was infeasibly expensive for most of the company’s lifetime.
> the business biting off more than it can reasonably chew
It's hoarding behavior. They paid "a lot" of money for it, have no idea how to further exploit it, but can't shake the feeling that it might be massively valuable one day.
The only difference is they pay someone to hold their hoard for them.
Alternatively ... they are forced to maintain it for compliance reasons, especially when it comes to healthcare, finance and other regulated industries (defense, in particular). This even applies to manufacturing companies assembly medical devices and defense products: all the data about the supply chain, the engineering designs & changes, the manufacturing and quality testing, and shipment needs to be kept for XX years and is subject to both audit by regulatory agencies and to legal discovery.
Those are all things which can be printed out and stored in alternative forms and possibly even recreated from other data. It's also the case that much of that data will never be permanently at rest and so several archive copies of the data exist.
Recordings of performances are an entirely different category of artefact.
Iron mountain also provides services like source code escrow.
With 2 parties involved in the data, you may want to impose additional restrictions regarding how and when it can be replicated. The party requesting escrow clearly has interest in the source being as durable as possible, but the party providing the source may not want it to be made available across an array of dropbox-style online/networked systems just to accommodate an unlikely black swan event.
A compromise could be to require that the source reside on the original backup media with multiple copies and media types available.
Also where you don't render the tracks pre and post processing and leave them aside to the ProTools project files. I don't know who expects to open a ProTools project with a bunch of unknown plugins after some years have passed...
Back in the 2000s the Australian government provided software that ran on Windows to prepare and submit your personal tax return. I used to archive my tax return by preparing it in a Windows VM, then storing the whole VM image.
I mean, even if by contract they were supposed to store physical media with the backups, it is still horrible incompetence to not have the same data backed up twice, and from time to time test the disks for failure to rebuild the backup from one of the copies.
It would be extremely unlikely for both disks to fail together.
What I'm describing is the bare minimum. This is their job, by all accounts. Amazing.
Makes me wish we didn't stop advancing optical media technology to where we have cheap and reliable archival quality 1TB discs for a few bucks each. I guess LTO is the best option for personally controlled archival.
We haven’t, but sadly the technology is locked to big tech.
Microsoft has demoed some cool technology where they store data in glass, Project Silica. Sadly, it seems unlikely this will ever be available to consumers. One neat aspect of the design is that writing data is significantly higher power than reading. So you can keep your writing devices physically separated from the readers and have no fear that malicious code could ever overwrite existing data plates.
Some blurbs
Project Silica is developing the world’s first storage technology designed and built from the media up to address humanity’s need for a long-term, sustainable storage technology. We store data in quartz glass: a low-cost, durable WORM media that is EMF-proof, and offers lifetimes of tens to hundreds of thousands of years. This has huge consequences for sustainability, as it means we can leave data in situ, and eliminate the costly cycle of periodically copying data to a new media generation.
We’re re-thinking how large-scale storage systems are built in order to fully exploit the properties of the glass media and create a sustainable and secure storage system to support archival storage for decades to come! We are co-designing the hardware and software stacks from scratch, from the media all the way up to the cloud user API. This includes a novel, low-power design for the media library that challenges what the robotics and mechanics of archival storage systems look like.
Why would they sell it directly? Works better if they can advertise their one of a kind, super stable, cloud specific data archival solution that nobody else can replicate. Or not even advertise it, but maintain lower storage costs per byte relative to AWS or Google.
As far as I know, the technology behind Amazon Glacier has never been shared. Glass disks could eventually be backing the Microsoft equivalent.
I doubt the decisions on the product came down along that logic.
Surely they could make more money by selling it in some form or another. If the economics actually gave them a storage cost advantage over AWS/GCP, then profitability must be possible.
In reality it's probably incredibly expensive, and the ROI could not be obtained without even further investment to drive the costs down.
>Why would they sell it directly? Works better if they can advertise their one of a kind, super stable, cloud specific data archival solution that nobody else can replicate.
Because network speeds aren't high enough to back up terabytes of data remotely on a regular basis. This would only work if you already store all your data with this vendor, which is probably a stupid move.
If network speeds aren't enough, there's Azure Data box, which is the equivalent to AWS Snowball, where they mail you a hard drive and you ship it back to them and they put it in their cloud.
Optical media is neat, but has a number of drawbacks when it comes to large scale operations.
What you're talking about already sort of exists, albeit media hadn't reached "cheap" yet, because the manufacturing scale wasn't there. People weren't interested enough in it. Archival Disc was a standard that Sony and Panasonic produced, https://en.wikipedia.org/wiki/Archival_Disc. Before the standard was retired you could by gen3 ones with 5.5TB of capacity, https://pro.sony/ue_US/products/optical-disc-archive-cartrid...
LTO tape was already at 15TB by the time their 300GB Discs came out, and reached 45TB capacity 3 years ago. Tape is still leaps and bounds ahead of anything achievable in optical media and isn't write-once. (https://en.wikipedia.org/wiki/Linear_Tape-Open)
Part of the problem is you can't just store and forget, you have to carry out fixity checks on a regular basis (https://blogs.loc.gov/thesignal/2014/02/check-yourself-how-a...). Same thing as with your backups, backups that don't have restores tested aren't really backups, they're just bitrot. You want to know that when you go to get something archived, it's actually there. That means you're having to load and validate every bit of media on a very regular basis, because you have to catch degradation before it's an issue. That's probably fine when you're talking a handful of discs, but it doesn't scale that well at all.
The amount of space that it takes for the drives to read the optical disc, the machinery to handle the physical automation of shuffling discs around etc. combined with the costs of it, just make no sense compared to the pre-existing solutions in the space.
You don't get the effective data density (GB/sq meter) you'd need to make it make sense, nor do the drives come at any kind of a price point that could possibly overcome those costs.
To top it all off, the storage environment conditions of optical media isn't really any different from Tape, except maybe slightly less sensitive to magnetic interference.
>LTO tape was already at 15TB by the time their 300GB Discs came out, and reached 45TB capacity 3 years ago.
No, they didn't. The largest LTO tape is only 18TB; your numbers are bogus. Those are BS advertised numbers with compression. If you're storing a bunch of movies or photos, for instance, you can't compress that data any further. The actual amount of data that the medium can physically store is the only useful number when discussing data storage media.
That's fair. The LTO capacity was already still significantly larger than archive disc at any stage in archive disc's life cycle.
Both Sony and Panasonic completely failed to demonstrate actual value from the format. Smaller capacity, for the same kinds of environmental constraints, similar size drives etc. There was just no reason to actually use it.
Yeah, it's really too bad someone hasn't made a reasonably-priced archival format that consumers and small businesses can use, because LTO isn't it. The closest they have is MDISC, but the storage capacity is small, and from what I'm reading, discs advertised with this aren't necessarily all that long-lived anyway (if they're using dye).
What we need is a cheap, write-once format that can hold at least 1TB, similar to how we used to use CD-Rs 20-25 years ago, but without organic dye like those discs and with a far longer shelf life.
I wonder if someone could mass-produce a BD-R type media, but the size of a laser-disc, and resistant to almost all scratching. Maybe put it in a case like the old 3.5" floppy disks had?
It's also quite expensive, at $6/TB/month. If you have a lot of data, that adds up quickly. Just for my 4TB backup drive, it's much cheaper to just use HDDs and rotate them myself.
Unfortunately recordable optical is on it's way out. Sony recently slashed the staff at the Japan plant that makes BD-R's (BD-R XL's). Still CMC makes CDR, DVDR, BDR though.
According to their video ( https://vimeo.com/502475794/ffbfb82b15 ), the company patented bit plane image storage. What the heck? That is so obvious and shouldn't be patentable.
On a side note, they keep touting how robust their data archival solution is. But I have my doubts. For example, if an image has a big patch of 0 or 1 bits, then it might be impossible to accurately align the bit positions ("reclocking"); this is the same issue with QR codes and why they have a masking (scrambling) technique. Another problem is that their format doesn't seem to mention error correction codes; adding Reed-Solomon ECC is an essential technique in many, many popular formats already.
> It may sound like a sales pitch, but it’s not; it’s a call for action
Your entire article sounds like a sales pitch. Your solution is, well, it's bad, but trust us, we can maybe recover it anyways. Otherwise your article fails to convey anything meaningful.
Tape has it's own problems. LTO drives only have 2 generations of backwards compatibility[0] and nobody makes new drives for old formats. So if you have a whole library of tapes you'll need to copy tapes over periodically to newer formats just to retain access to them, which is expensive.
And once you start doing that, you've just quashed THE advantage tape had over disk. LTO doesn't provide any more reliability, it just shifts the failure points around. Instead of 20 year old sealed hard drives with bearings that will seize up and render your data unreadable, it'll be perfectly stable 20 year old tapes that no drive in the world can read. I'm also skeptical of the cost savings from cheap media once periodic remigrations are priced in, but it might still win out over disk for absolutely enormous libraries (e.g. entire Hollywood studio productions).
And no, there isn't some other tape format that has better long-term support. Oracle stopped upgrading T10000 around 2017, and IBM 3592 has an even worse backwards compatibility story than LTO.
[0] LTO-8 drives only have 1 generation of backwards compatibility because TMR heads get trashed by metal particulate tapes
Tape drives aren't economical. If you're a big company, sure, they make sense, but for individuals and small companies, they really don't'. Hard drives are absolutely the only realistic and affordable way to back up data. They're not bulletproof though, so you need multiple hard drives, and you need a backup strategy that rotates them, so even if one fails, you haven't lost too much.
As someone who only needs to backup 10TB, I looked once at getting into tape - the cost of getting a tape drive and a way to connect it to my computer was eye-watering. The very long-term prospects were even worse: I’d have to choose between buying multiple LTO-n tape drives for redundancy, or keep upgrading every 7ish years or so, which entails buying more tape drives.
I’ll stick with my 10 2TB hard drives, zfs, and biannual swaps to new hard drives, sadly. At least I won’t ever have to deal with more than 10 hard drives at time, assuming $/GB never increases.
At some point I should start sticking a Windows hard disk image on there; .vhd will still probably be readable and bootable in a VM 10 years from now.
I think it is or at least tries to be more than that. Not only stop archiving to HD's but understand the dependency requirements which is a whole secondary problem.
The last time I used a tape it broke right off the reel the first time I used it. Good name brand tape drive, good name brand tape. I felt pretty burned by that experience after the money spent and the result.
We’re about to start a project to build an LTO-9 based in-house backup system. Any suggestions for DIY Linux based operation doing it “correctly” would be appreciated. Preliminary planning is to have one drive system on in our primary data center and another offsite at an office center where tapes are verified before storage in locked fireproof storage cabinet. Tips on good small business suppliers and gear models would be great help.
If you are having trouble getting 10TB of disk space from IT, you have bad IT. Not saying that's uncommon or anything, but 10TB fits on one external hard drive for $300 from Best Buy, or less than $1000/mo using EBS on AWS if you need some better guarantees and are all-in on the cloud.
> Tapes are fun. You can fit a petabyte of data in a bankers box!
Yes, though those ultra-thin M.2 NVMe drives could probably top that now.
> Do we have a drive that can read this tape?
Don't let this be a problem in the first place: buy 4 tape drives and keep 2 of them in your cold/offsite/airgapped storage site (2 in case 1 fails, so you can use the remaining drive to transfer everything to a newer format).
> Do we have server we can connect it to?
Significant hardware should not (and is not) necessary: the LTO-7/8/9 drives I see for sale right now seem to be using either USB 3.0, Thunderbolt, or SAS connections; USB and Thunderbolt can be handled by any computer you can find at a PC recycler today; while any old desktop can handle SAS with a $80 HBA card.
> Do we have storage we can extract it to? (go ask your internal IT team for 10TB of drivespace...)
10TB isn't a good example (case-in-point: I have a 3-year-old stack of unused 12TB WD drives less than 3 feet away from me).
That said, if you're enterprise-y enough for a $4000 LTO-9 drive, then you probably also have a SAN that's chock full of drives, so being able to provision a 10TB+ LUN should be implicit.
> What program did we create this tape with? Backup Exec, Veritas, ArcServe, SureStore
Ideally, none of those; instead, good ol' Perl and `dd`.
> You have the encryption keys, right?
I don't encrypt my backups to avoid this problem. My old data archives have little exploitable value for any potential attacker; and I imagine I'd store backup tapes this important in a fireproof safe in my parents' house or something. I'd only encrypt the entire tape if the tape were to leave my custody.
I appreciate that this is not for everyone, and it's probably illegal for some people/orgs to not encrypt backups too anyway (HIPAA, etc).
> How much of this data already exists on the previous months backup?
Incremental/Differential backups are still a thing.
> Who's going to pay for the storage to move it to Glacier/etc?
No-one should. Cold data backups should/must always be in the custody of a designated responsible officer of the company.
BUUUT, I guess there's nothing wrong with storing an encrypted copy in Cloud storage (as in S3/Glacier/AzureBlobs - not OneDrive...). I actually do this right now thanks to the smooth and painless integration in my Synology NAS. It costs me about $15/mo to store all these TBs in S3.
> How long is it going to take to upload?
Consider that it's 2024 - a company with LTO-9 and SAN is probably going to have a metro-ethernet IP connection at 10Gbps or even faster. At home I have a 10Gbps symmetric connection from Ziply (it's $300/mo and they give you an SFP module, which I put into my Ubiquiti UDM): so the limiting factor here is not my upload speed, but my drive read speed (LTO-9 drives seem to read at about 2-3Gbps raw/uncompressed?)
Make sure the bandwidth exists to keep up with the write speed of the LTO drive. For instance, the write speed for LTO-6 (which I own as a hobbyist) is around 300MB/s, but line speed of gigabit Ethernet is about 100MB/s. Translate those numbers to LTO-9 and make sure that the NAS, network, or local storage can keep up. It's not a deal-breaker to underflow the drive, but it causes the tape to stop, rewind, and re-buffer (called shoe-shining) which takes more time and causes unnecessary wear on the drive and cartridges.
Nothing is fire proof. Is the cabinet "fire suppression system liquid" proof?
> Tips on good small business suppliers and gear models would be great help.
Hire an auditor would be my advice. Every business is different.
I am, just now, having flashbacks of when I was in a SOX environment and had to regularly contract with them... and while the experience can be somewhat unpleasant I've often found good auditors to be extremely knowledgeable about solutions and their practical implementation considerations.
They're not fully sealed. There are two shafts which connect the King's chamber to the exterior of the pyramid. The lower ritual congregation area is not fully sealed off from the upper chambers either. Which means bats are a constant problem in pyramids.
Archival is more of a process than only a question of media. First you must create a proper database of your archived data.
Maybe you want to do 3 copies, not two. Maybe you want to use two different archival formats such as tar and LTFS, just in case. Maybe you want to source your media from both available producers (Sony and Fuji) because in the long run, maybe one or the other may grow some funky error mode or corruption problem. Etc.
LTO-9 tapes can be easily found on Amazon in many countries, made by IBM, HP, Quantum or Fuji.
The vendor does not matter, whichever happens to be cheaper at the moment is fine.
For the tape drives, the internal drives can be cheaper by around 10%, but I prefer the tabletop drives, because they are less prone to accumulate dust, especially if you switch them on only when doing a backup or a retrieval. The tape drives have usually very noisy fans, because they are expected to be used in isolated server rooms.
I believe that the cheapest tape drives from a reputable manufacturer are those from Quantum. I have been using a Quantum LTO-7 tape drive for about 7 or 8 years and I have been content with it. Looking now at the prices, it should be possible to find a tabletop LTO-9 drive for no more than $5000. Unfortunately, the prices for tape drives have been increasing. When I have bought an LTO-7 tabletop drive many years ago it was only slightly more than $3000.
The tapes are much cheaper and much more reliable than hard disks, but because of the very expensive tape drive you need to store a few hundred TB to begin to save money over hard disks. You should normally make at least two copies of any tape that is intended for long-term archiving (to be stored in different places), which will shorten the time until reaching the threshold of breaking even with HDDs.
Even if there are applications that simulate the existence of a file system on a tape, which can be used even by a naive user to just copy files on a tape, like copying files between disks, they are quite slow and inefficient in comparison to just using raw tape commands with the traditional UNIX utility "mt".
It is possible to write some very simple scripts that use "mt" and which allow the appending of a number of files to a tape or the reading of a number of consecutive files from a tape, starting from the nth file since the beginning of a tape. So if you are using only raw "mt" commands, you can identify the archived files only by their ordinal number since the beginning of the tape.
This is enough for me, because I prepare the files for backup by copying them in some directory, making an index of that directory, then compressing it and encrypting it. I send to the tape only encrypted and compressed archive files, so I disable the internal compression of the tape drive, which would be useless.
I store the information about the content of the archives stored on tapes (which includes all relevant file metadata for each file contained in the compressed archives, including file name, path name, file length, modification time, a hash of the file content) in a database. Whenever I need archived data, I search the database, to determine that it can be found, for instance in tape 63, file 102. Then I can insert the corresponding cartridge in the drive and I give the command to retrieve file 102.
I consider much better the utility "mt" of FreeBSD than that of Linux. The Linux magnetic drive utilities have seen little maintenance for many years.
Because of that, when I make backups or retrievals they go to a server that runs FreeBSD, on which the SAS HBA card is installed. When a tabletop drive is used, the SAS HBA card must have external SAS connectors, to allow the use of an appropriate cable. I actually reboot that server into FreeBSD for doing backups or retrievals, which is easy because I boot it from Ethernet with PXE, so I can select remotely what OS to be booted. One could also use a FreeBSD VM on a Linux server, with pass-through of the SAS HBA card, but I have not tried to do this.
My servers are connected with 10 Gb/s Ethernet links, which does not differ much from the SAS speed, so they do not slow much the backup/retrieval speed. I transfer the archive files with rsync over ssh. On slow computers and internal networks one can use rsync without ssh. I give the commands for the tape drive from the computer that is backed up, as one line commands executed remotely by ssh.
The archive that is transferred is stored in a RAMdisk before being written on the tape, to ensure that the tape is written at the maximum speed. I write to the tape archive files that have usually a size of up to about 60 GB (I split any files bigger than that; e.g. there are BluRay movies of up to 100 GB). The server has a memory of 128 GB, so I can configure on it a RAMDdisk of up to 80 GB without problems. This method can be used even with a slow 1 Gb/s or 2.5 Gb/s network, but then uploading a file through Ethernet would take much more time than writing or reading the tape.
There is one weird feature of the raw "mt" commands, which is poorly documented, so it took me some time to discover it, during which I have wasted some tape space.
When you append files to a partially written tape, you first give a command to go to the end of the written part of the tape. However, you must not start writing, because the head is not positioned correctly. You must go 2 file marks backwards, then 1 file mark forwards. Only then is the head positioned correctly and you can write the next archived file. Otherwise there would be 1 empty file intercalated at each point where you have finished appending a number of files and then you have rewound the tape and then you have appended again other files at the end.
A lot very interesting details in your reply - thanks. I have this question:
If you aren’t budget constrained today and had to set it all up again. What would you do?
While I’m a Linux guy, I’ll happily run BSDs when appropriate, like for pfSense, and if it really has better mt tools or driver for LTO-9 drives due to the culture/contributors being more old school, then I’d just grab a 1U server to dedicate for it run a BSD and attach the drive to that.
You seem to have extensive practical hands on experience and while I was doing tapes 20 years ago this will be first time I’m hands on again with it since then. So I need to research most reliable drive vendors and state of kernel drivers and tools, just as you are alluding to.
Pretend you have $50K if needed (doubt it). 2PB existing data, 1PB/year targeted rate, probably 10-20%/year acceleration on that rate. with a data center rack location, 20Gb/s interconnect via bonded 10Gb NICs to storage servers (45drives storinators) and then an office center cabinet/rack/desk (your choice) and will put a tape drive holding at least 8 tapes in data center, planning for worst case of 100TB a month and data center visits to swap in new tapes shouldn’t be too frequent. Any details on what you would do would be interesting.
Like I have said, it is not necessary to dedicate a full-time FreeBSD server for this, you can use either a Linux server that is rebooted temporarily in FreeBSD or a FreeBSD virtual machine on the Linux server.
Around $5000 to $6000 should be enough for a LTO-9 tabletop tape drive plus a suitable SAS HBA card and SAS cable. The card must have matching SAS connectors and SAS speed with the tape drive.
More money will not bring anything extra until a much higher amount is reached, which would be enough to buy a tape autoloader/library, which would eliminate the necessity for a human to insert and remove the cartridges into the tape drive when needed. I am not sure if $50K is enough for a tape autoloader.
Tape autoloaders/libraries are worthwhile only for very big organizations where the amount of data that is continuously written or read to or from the tapes is very large. For a small business or for an individual a tape autoloader is certainly not worthwhile, because the tape drive will be in use at most a small fraction of every day.
1 PB/year is less than 3 TB/day. This can be written on a single tape in a little more than 2 hours. Even with a simple non-pipelined implementation of the file uploading with the writing on the tape, the backup can be done in less than 4 hours. Even writing 2 copies can be done in less than 8 hours. The backup can be done mostly or completely overnight.
For a much bigger amount of data one could buy several tape drives, before starting to think about an autoloader. Also it is possible to pipeline the network transfers with the tape writing, for a backup speed higher by around 50%.
If money would not be a problem and if the data needs to be archived for a long term, so that multiple copies are desirable, I would buy 2 tape drives, to be able to write 2 copies simultaneously.
This would also halve the time for archiving the initial 2 PB of existing data, which will take several months, so a speed-up would be desirable. Having 2 drives will also increase the reliability, as the system will continue to work if one becomes defective.
With only 3 TB written per day, a LTO-9 tape, which has a capacity of 18 TB, will be enough for 6 days.
So unless a backup must be restored, the operator would need to change the tape only once per week.
This is a moderate amount of data, easy to handle with a single drive, even if two are preferable for redundancy and for higher speed.
I do not understand your reference to a "a tape drive holding at least 8 tapes in data center". If you mean an autoloader, from what you describe it does not seem that the very big expense for an autoloader would be justified.
The LTO tapes are best stored in suitcases that can contain 20 cartridges, i.e. when using LTO-9 that is 360 TB. Therefore 3 suitcases store more than 1 PB, i.e. a year of data according to your example. The suitcases should be stored in a secure safe or cabinet. They are usually made to be stackable.
I have assumed that your 1 PB is of already compressed data. If the data is compressible than the requirements for the usage time of the drives and for the storage volume would be much smaller.
I have forgotten to mention that after I compress and encrypt the archived files, I add redundancy with a Reed-Solomon code, e.g. with the par2 program. If I choose e.g. a redundancy of 5%, then a file retrieved from the magnetic tape could have defects of up to 5% of its size, while the original data could still be extracted from it.
Excellent help. To clarify a few items:
- yes I mean drives with autoloader. for example: https://www.backupworks.com/qualstar-Q24-LTO-9-SAS-Library.a...
it’s basically a hard requirement as we don’t have staff time to enter data centers frequently. we are a bit unusual in being certainly not big, but not really a small business either when looking at budgets available. unless there is something wrong with qualstar product linked above perhaps autoloaders are cheaper than you believed?
- understood your rebooting trick. however being full automated (apart from blank tape rotations) is a requirement also. it’s a production infrastructure. if FreeBSD provides significant value it seems safer to spec a dedicated 1U server to use for backups. there is a management node currently that might work though that has to run Linux as it currently does and I need to check if the SAS on it can be used. It has an bunch of SAS ssd drives currently and I would have assumed there is a way to cable up the qualstar drive … but again I’m still early in researching. and the SAS compatibility issue you raise is perfect example of stuff I need to figure out.
- love par2cmdline and our burner with mdisc for IP backup uses that on git repo files and then seqbox as an outer container for data to guard against potential fs metadata corruption issues. there was a newer low level tool (rust rewrite I think) with many bitrot protection features that I can’t recall it’s name currently and isn’t immediately coming up in my notes, but I know it exists and have been meaning to look into it. it has a newer erasure encoding like raptorq and also block metadata like seqbox, I think can replace the par2 seqbox combo we are currently using on MDISC physical backup for IP. I don’t trust a 100% cloud as one can imagine somehow getting all accounts hacked and deleted.
- yes on compressed. the 2PB is already highly highly compressed. so it means 18TB/tape.
Do you have any vendor/distributors you can recommend? I always recommend 45drives to people and I was planning to ask them about LTO when we order next storinator which is coming up soon also.
There is this interesting blog post from a couple of years ago that probably was the seed of my plan to embark on LTO. Our monthly backblaze invoice is totally out of control. But we need a full backup of our data as it’s simply not replaceable and at the heart of the business.
If you would use the full configuration with 2 tape drives, the cost of the system might be around $15k, which is very reasonable for a tape library with autoloader.
I think that this autoloader is a good choice, especially if the price includes "1 x IBM LTO-9 SAS Tape Drive Installed".
As I have said, I believe that it is better to choose the option of also including the second tape drive.
For the tapes, there is no reason to worry about specific distributors. I have always bought them from Amazon, but shops that are specialized in storage products should be OK, unless they charge a premium price over what can be found at Amazon or Newegg. While the tapes are made by Fuji or Sony, they are usually easier to find and at at lower prices as IBM, HP or Quantum branded tapes.
The prices vary, so whichever vendor is cheaper when you buy a batch of tapes should be fine. An LTO-9 cartridge should be only slightly over $100. In time the prices of LTO-9 cartridges should drop. For now they are more expensive than the older cartridges, because they are still relatively new.
You must check the tape drive requirements for the SAS HBA PCIe card that must be installed in the server, which must have compatible connectors, and you must buy an appropriate SAS cable. I believe that the LTO-9 drives require the newer 12 Gb/s SAS standard and also the newer variant of the external SAS connectors (perhaps SAS HD SFF-8644 connectors).
If you already have a 12 Gb/s SAS HBA that has only internal connectors for SSDs, it is possible to reuse it by buying a SAS internal to external adapter of the appropriate connector types, which must occupy one of the empty expansion slots of the server case and which plugs into the internal connectors, while providing external connectors. Such adapters can also be used with server motherboards that have on-board SAS controllers. If you have a SAS HBA card that has external connectors, but different from those on the tape drive, e.g. SAS SFF-8088, there are cables with mixed SAS connectors that can connect the tape drives. The HBA cards usually have at least 2 external SAS connectors, suitable for 2 tape drives.
With the autoloader, it should be easy to make the backup or retrieval process completely automatic, so that an operator should not have to visit the tape autoloader more often than at a few months interval, except for the initial phase when you would have to write 2 PB on almost 120 tapes (or a double number for improved redundancy, beyond the redundancy added per each archive file; 2 copies can be stored in 2 different geographic locations, to avoid the catastrophic loss of all tapes), so you would want to keep the tape autoloader in an easily accessible place for that time.
The initial cost for writing 2 copies of 2 PB of data, i.e. 4 PB of data, would be not much less than $30k for the tapes. This, together with the autoloader with 2 tape drives, HBA card, cases, cables and maybe adapters, would be in the range of $45k to $50k, so within your estimated budget.
As I have said, it is convenient to have a database with the metadata (including content hashes, made e.g. with BLAKE2b-512 or with BLAKE3-256) of all the files that have ever been archived, which shall be used whenever information must be retrieved and which can also be used for deduplication (for which the content hashes are handy), to check whether a file is already present in some earlier archive, so there is no need for its backup.
I want to add that when you start testing the tape drives, one of the first things that you need to do is to measure the exact capacity of an 18 TB LTO-9 tape cartridge.
For instance, I write the tapes with "dd bs=131072 if="$file_name" of=/dev/nsa0". This means that I am using 128 kB blocks. I have measured that a 6 TB LTO-7 tape cartridge has a capacity of 45905860 such 128 kB blocks.
The position of the read/write head, measured in blocks from the beginning of the tape, can be obtained with "mt rdspos". After you choose some block size, e.g. 128 kB, you should forever stick with it in all your write commands and on all your tapes, so that you will always get consistent information about the position of the read/write head.
The tape capacity can be measured by writing files, preferably of the same size that you will typically use for archives (in order to write a similar number of file marks), until you get a write error.
With the capacity of the tape known exactly, after any writing of a new file you get the current position and you compute the remaining free space on the tape, to know whether you can still append data or you must change the tape.
The position in blocks can also be used to verify that the tape drive works OK. For example when after rewinding the tape you go to the end of the written part, to append new files, you must see the same position as after your last write. Or when writing a copy of a tape, you must see the same positions on both tapes for any file.
For retrieving files, the position in blocks does not matter, but only the ordinal number of a file. You position the read/write head to the beginning of a file with "mt rewind; mt fsf $file_number". Then you read the file, possibly in a loop if you want to read multiple consecutive files.
For going to the end of the written part of a tape, to append new files, you must use "mt locate -e; mt bsf 2; mt fsf", as I have mentioned in a previous posting. The explanation of why this is needed is buried in the documentation about how tape marks and head positioning really work.
Whenever I start using the tape drive, I use "mt comp off; mt status" and I check the status output to be as expected.
The tape is ejected with "mt -f /dev/esa0 rewind".
At the currently advertised reduced price of $7226, the Quantum SuperLoader 3 would be a good choice.
I would buy 2 of them, which together with all the other items and with 4 PB of tapes for the migration of the existing data would not exceed your estimated budget of $50k.
I assume that for this price you might get the 8-slot version. Quantum SuperLoader 3 can be extended to 16 slots, but I assume that for this you must buy an additional 8-cartridge removable active magazine. You should check the price for that.
Because I had good experience with the reliability of my Quantum tabletop tape drive, I would recommend this Quantum autoloader. Moreover, its datasheet includes all the expected information about reliability parameters, so they are tested by the manufacturer.
I consider the included backup software as useless. You should write your own backup scripts. You might need a few days for this, depending on the previous experience and on the support provided by the utilities specific to the file systems that you happen to use, but then you can be worriless for years, unlike when you depend for all your precious data on a black box proprietary program, which cannot be trusted to do the right thing, and which might write data in a format that cannot be recovered with any other tool (without an extensive reverse engineering work).
Regarding Linux' "mt", there are two versions : the horrible, primitive version that comes with cpio and is almost certainly the one that's installed as default : and "mt-st", the actually usable one.
Great post. You might be able to elide the RAM disk in lieu of the "mbuffer" command. My script uses a combination of dd | pv | mbuffer | mt. I omitted the options because I don't remember any of them. I personally use dd of an ext4 filesystem-on-file that is exactly the size of what will fit on tape. This was simply because I couldn't figure out how to reliably advance the tape head or how to continue a write from one tape to another.
The advice I got long ago from an IT guy was: if you wait long enough, tape will be on top again.
That was a long time ago but I’ve peeked in at backup systems in the intervening years and it does seem to hold true over time.
But it really depends how much data you have. My ex dropped a single HDD in a safety deposit box at CoB, N times per week and fetched back the oldest disk. I don’t think she ever said how many were in there but I doubt it was more than three. I think the CTO took one home with him once per week.
The silly thing about most of this set up is that the office, the bank, and the data center were all within half a kilometer of each other. If something bad happened to that part of town they only had the infrequent offsite backup.
Every time I've looked, tapes were on top "again" for large scale archival. And I've been looking for ~20 years by now.
I don't get where people get the impression that X was at the top right before tapes got that last innovation (where X here is most often HDDs, but not always). But that's always the impression, and tapes are always on top.
People also have been working with 3D phase change drives since the 90s. Those always promise to replace tapes. But nobody ever got them robust enough to leave the labs.
Tape might win for large-scale systems, but it's basically dead for home-office scale.
You used to be able to get modestly priced tape units for home use from the old "connects to the floppy controller" units with capacities in the tens of megabytes, up to some late-gen SCSI/IDE/parallel/early USB models that would be a couple of gigabytes, but still at home-friendly prices. What's today's answer? An enterprise-grade device that might put 10TB+ on a tape, but comes with a four-figure price tag and isn't really sold at Best Buy.
If I want to back up the house today (maybe 4 active PCs, 5-6Tb of total space), affordable choices are pretty much disc-based. I could choose a cheap NAS (ended up doing that with an old fanless Atom machine and a used 12TB datacentre drive) or get a USB-attached external drive. Even if I used BD-XL media, even my modest needs would be dozens of discs, plus getting a writer in a shrinking market. There are plenty of datahoarders with much bigger needs, but even for them, tape is completely outside the addressable market.
Amount of data makes it less realistic. We have around 2PB data currently and expect to grow around 1PB next year with maybe 10-30% annual growth rate.
$50K if needed but it doesn’t look to need that. 2PB initial data. predicted 1PB/year with around 10-30%/year rate of growth of rate of growth (acceleration?)
I once pressed my boss into having off-premises storage of documents so we could still manufacture product in a new facility if the current facility burnt down. Unfortunately, someone started the habit of sending the primary documents to the same facility if the product was deprecated. One day, that off-premises facility burnt down and all the contents was lost. I think it was a regular self-storage space.
That aside, this sounds extremely old-fashioned, but it seems to me that the only media that is acceptable for long-term storage is going to be punched paper tape. How long does paper last? How long do the holes in it remain readable? Can it be spliced and repaired?
"Of the thousands and thousands of archived hard disk drives from the 1990s that clients ask the company to work on, around one-fifth are unreadable."
Why is this surprising?
It's been known for decades that magnetic media loses remanence at several percent a year. It's why old sound tape recordings sound noisy or why one's family videotapes of say a wedding are either very noisy or unreadable 20 or so years later.
Given that and the fact that hard disks are already on the margin of noise when working properly it's hardly surprising.
The designers of hard disks go to inordinate lengths to design efficient data separators. These circuits just manage to separate the hardly-recognizable data signal from the noise when the drive is new and working well so the margin for deterioration is very small.
The solution is simple, as the data is digital it should be regenerated every few years.
Frankly I'm amazed that such a lax situation can exist in a professional storage facility.
Edit: has this situation developed because the digital world doesn't know or has forgotten that storing data on magnetic media is an analog process and such signals are deteriorated by analog mechanisms?
I'm amazed too. The mere fact that they accept storing hard disks powered off for years is a big red flag. It's in everyone's list of top things to avoid doing, not just in archiving circles:
People should learn a thing: data are not tied to the physical media hosting them, like words on paper, and the sole way to preserve data is migrating them from a physical support to another regularly, also converting their formats sometimes, because things changes and an old format could end up unreadable in the future.
The only reason we have any copies of "books" (i.e. long written works) from the ancient world is that they were painstakingly copied over centuries from one medium to another, by hand for most of that time.
Definitively right, above I intend we can't preserve bits simply collecting them in libraries of physical supports like we do with shelves full of books or folders of archived sheets. With bits there is no "original" and "copy", any copy is still "original", we do not loose information or introduce changes because of that.
For long term archiving, the fundamental hard problem is the storage density, The further the storage unit size drifts below human scale the harder it is to long term archive.
I think for the average person, the best thing to do, for long term archives is to take advantage of sturgeons law, "90 percent of everything is crap". triage the things you want to archive to a minimum, then print them out, at human scale, on paper. have physical copies of the photos you want to keep, listings of the code you are proud of, correspondence that is dear to you.
This will last, with no intervention, a very long time. Because as is increasingly becoming obvious, once the format drifts below human scale the best way to preserve data is to manage the data separate from the medium it is stored on with a constant effort to move it to a current medium. where it easily evaporates once vigilance drifts.
I grew up in Pittsburgh. When I was flying in and out of the Pittsburgh airport (usually to Atlanta) during and after college, I would often see Iron Mountain uniformed employees waited for standby seats carrying their little pelicans cases…
A hard drive from 1995 will most likely be formatted FAT16/or FAT32 and the last Windows OS that reads that filesystem by default is Windows XP - I keep an old XP workstation operational as a test rig for reading data on old hard drives.
Or I have to mount them in another OS that isn't Windows. It's more than just adjusting DAW settings and updating plugins at this point you need to know that around 2000 the filesystems completely changed with NTFS and added security that wasn't present before.
By the time Vista/7 FAT hard drive support is gone from Microsoft land. There are of course add-ons and such but you still need to _know_ this happened and FAT drives look unformatted in modern Windows.
An interesting topic would be the debate on how to store data for extremely long periods of time, several hundred thousand years, for example documentation of nuclear waste sites from power plants... Any ideas?
As counterintuitive as it may be, it seems to me like the only reliable long-term storage for data is with commercial cloud providers.
Any time you're physically warehousing old hard drives and whatnot, they're going to be turning into bricks.
Whereas with cloud providers, they're keeping highly redundant copies and every time a hard drive fails, data gets copied to another one. And you can achieve extreme redundancy and guard against engineering errors by archiving data simultaneously with two cloud providers.
Is there any situation where it makes sense to be physically hosting backups yourself, for long-term archival purposes? Purely from the perspective of preserving data, it seems worse in every way.
^ This. Physical media is continuously degrading. Large storage systems work by regularly reading, verifying, and replicating data - it is always doing backups and restores. If this isn't happening actively and regularly, your data will cease to exist at some point in time.
Whether we collectively need to store all these things is another question entirely. But if we want to keep it - we'll have to do the work to keep it maintained.
> Two cloud providers pretty much guarantees against that -- the idea that two would terminate it simultaneously is vanishingly small.
Ask Julian Assange about that. Sure, the US government claimed he had committed a crime, but he disagreed. You really need to store your data in at least two of (a) NATO and Western-aligned countries (b) Russia and aligned countries (c) Mainland China, and that’s assuming the prior probability of you being at risk in those blocs is low. It’s hard to avail yourself of this if you’re a legitimate company, but plausible if you’re a private citizen.
SLA numbers don't say anything about your particular piece of data, just all customer data on aggregate.
> The idea that two would terminate it simultaneously is vanishingly small.
Au contraire, if you become sanctioned or illegal, both would necessarily have to terminate simultaneously, if the cloud providers want to comply with local laws.
You're not using a Chinese or Iranian cloud, are you?
Cloud provider stats are based on aggregate numbers. If they claim 99.999% of all data is retained, and they have 100,000 TB of data collectively, then if they lose your entire 1 TB of data then they can still claim that they maintain 99.999% of data as long as they don't lose anyone else's data.
But in practice everybody's data is widely distributed.
So I'm not sure what kind of event you're imagining that would take down one customer's data and no one else's.
It's not an issue. Whereas if you're warehousing all your data in a single physical location, it's vastly more likely for it to be fully destroyed due to a fire/earthquake/flood/etc.
Since storage constantly gets cheaper, 100GB first stored in 2001 can be stored on updated media for a fraction of that original cost in 2024.