Hacker News new | comments | show | ask | jobs | submit login
The Quest for a Universal Translator for Old, Obsolete Computer Files (atlasobscura.com)
95 points by Erlangolem 11 days ago | hide | past | web | favorite | 31 comments

Reverse engineering efforts, like it was done for old "office" file formats by the LibreOffice team. Or the same by various free *CADs teams. There are even tools for helping reverse engineering file formats like Kaitai [1].

[1] https://kaitai.io


Your link gave me a warning that the certificate was only for github domains. This one redirects to kaitai.io but doesn't warn.

That's because it redirects to the http version of the page.

Kaitai sounds really well done. Thank you for the link!

Was looking good until the penultimate paragraph where it says that this tool will not be available to the general public. Would have been nice to have had that up front so that I could have stopped reading right there.

If the physical media are indeed floppies and CDs, these guys will need to take into consideration high share of corrupted data. I know, I know, this is not what the project is about, but without it, the outcome will be more theoretical than practical.

I remember having trying to read a decade-old stash of floppies a few years ago. Too many were unusable. CDs are much better but also relatively unreliable. My USB drives fared much better, even the 64 Mb one with the kangaroo logo still works.

The need to digitize and preserve media is a problem libraries are well aware of. In addition to a real need to migrate floppies and CDs to newer media, there's also an urgent need to move substantially older and one-of-a-kind media (i.e., 8mm film reels, LPs; think rare interviews with celebrities and luminaries from the 30s and 40s).

For digital media in particular, the Internet Archive is pretty zealously ripping older media [1].

[1] https://motherboard.vice.com/en_us/article/785zmx/jason-scot...

[edit] A few interesting examples of the sorts of problems that preservationists encounter in dealing with pre-digital formats can be seen here: https://www.nyu.edu/tisch/preservation/program/05fall/physic...

I haven't seen much discussion about how 20+ year-old CDs are doing. IIRC about half of my self-burned CDs have failed after about 10 years. I wonder what the situation is with factory-stamped -data- CDs. (I have yet to -hear- a problem with my 30+ y.o. music CDs, but I suspect that some aren't 100%.)

Slightly off-topic (but not much) and as a single data point, IMHO a part of the issue with old "burned" CD's is not only the CD but rather the "new" readers (and their speed).

I still have a very old 1x SCSI CD drive (the kind that not many people will remember, using a caddy-cartridge for the CD) on a computer dedicated to "recover" old CD's and the success rate is much higher than on modern CD/DVD readers. I would say 9 out of 10 with the SCSI drive vs. 6/10 on a modern CD/DVD drive.

Very good point.

This is anecdotal, but in my retrocomputing adventures I use a lot of factory pressed CD-ROMs from ~1993-2000 and I have yet to see one fail to read purely due to age.

Burned CD's are a lot worse. How much worse depends a lot on quality and type of disc, in my experience. If you cheaped out on CD-R's by buying the no-name 50 pack spindles back in the early 00's, those will likely all be unreadable now. The fancy gold Kodaks might have survived, though.

I also collect music CD's and it's getting to be a problem that there's a lot of obscure music (demos, live recordings, etc) that only exists on CD-R, and in very limited quantities. I do have rare demos by local bands etc. that have started skipping just because of the CD-R media aging.

There are some albums we have lost because of the disk rot problem. If well made CDs are lasting pretty well. But remember some applications have had doubts about how well paper lasts.

This is why DNA storage is exciting. Barring a civilization reset catastrophe we will have a need to read DNA and improve the process for the long term future. Also in the right environment it has nice self-healing features.

But what happens if that catchy pop song turns out to be truly infectious when coded into DNA? It's the end of the world as we know it.

I'm always surprised when people talk about the unreliability of floppies. I still have my stash of Atari ST games from the early 90s and play them on a real machine regularly and I haven't had a single failure. Perhaps the lower density of 720k floppies makes them less prone to corruption, or perhaps I've just been lucky and stored them well.

1.44M floppies are easily the worst of the common formats when it comes to reliability. Even back when you bought them new in stores they were bad, and now they are very bad.

Double-density 3,5" media like the disks for your ST are much better, and both 360K and 1.2M 5,25" types have proven quite resilient in comparison. I play a lot with retro computers in my spare time and anecdotally I find that a random 5,25" 360K disk from 1985 is much more likely to be readable than a 1.44M one from 1995.

1.44MB floppies wernt particularly bad in design, but many people remeber them from the tail end of the format, where reputable brands left the market and only the crap was left over, and floppies from that era truely did suck.

Late 90s floppies and drives were definitely not as good as early 90s and 80s floppies.

Older USB drives use SLC flash, which is definitely more reliable than the MLC+ encountered today. (The downside is lower capacity; made slightly less bad by the fact that they were actual binary sizes --- I have a 16MB(!) drive that still works, and has exactly 16,777,216 bytes of user-accessible sectors.)

I have a PDP/11-23 in my lab right now. It contains the data and code my former advisor used to get his tenure-track job, as well as an interface to an electrophysiology recording rig. Takes up a good part of the room.

I have no idea where to put this thing, but we virtualized the entire thing years ago and spin up the VM when required.

If you're looking to get rid of it and are in the Midwest I might be interested... :P

east coast. I've already asked several computing musems, whose answer was "we have 5 in the basement already, but thanks!"

> Each new release tends to only support files created within the last two versions

Ugh, I hate software that does this. If you're creating an obscure proprietary format for your files, the least you can do is support it.

Laudable idea, but there's just so much variety of old platforms, file formats, and backup formats. It's hard to guess what might be highest priority. Maybe "Universal" just sounds a little strong to me.

Just covering every popular RISC/Unix platform would be daunting. Ever hear of Pyramid Osx? It was once popular. That's skipping over mainframes (not just IBM either, see https://en.m.wikipedia.org/wiki/BUNCH), mid-range (os/400, mpe, VMS, Tandem), OS/2, BeOS, embedded platforms, and much, much, more.

There have been some efforts in the past to take care of at least one of the problems that you enumerate. PRONOM [1], GDFR [2], and UDFR [3] sought (and still strive in the case of PRONOM) to be more formalized versions of the /etc/magic file, so that digital formats could be more readily classified automatically.

Unfortunately UDFR and GDFR have fizzled out (a theme that occurs sometwhat often when projects have high ambitions and inadequate support). PRONOM is still around, but has difficulty adding in accurate information since the overlap between the technologists with the domain expertise necessary for such a database to succeed and librarians is quite small. It would benefit quite greatly from an infusion of engineers who wouldn't mind filling out forms with corrections. :)

[1] https://www.nationalarchives.gov.uk/PRONOM/Default.aspx

[2] http://library.harvard.edu/preservation/digital-preservation...

[3] http://www.udfr.org/

This is a good introduction to some of the issues facing academic libraries these days. For a more in-depth look at what strategies librarians are using to provide access to older operating systems / formats / user experiences, I'd recommend taking a gander at "Emulation & Virtualization as Preservation Strategies" by David S. H. Rosenthal:


I've worked on this problem in the past for modernization and migration away from legacy languages to modern ones. You need to write a parser for the source language, but that is a one-time upfront cost. The parser should populate an intermediate language agnostic model and then you can write as many generators as you require to translate that agnostic model into your target language of choice.

We had one-click solutions for COBOL to C, Java to C#, VB to Angular, etc. Once you have the parsers and generators the work for each project after that is minimal.

File formats are fascinating: https://github.com/corkami/pics/blob/master/binary/README.md

For game file formats you may want Xentax (http://forum.xentax.com/) or the "fork" Zenhax (https://zenhax.com/)

Open some files in a hex editor and have a look. Run "strings". Investigate and learn.

When I was younger I used to play BBS games. Things like LORD, ArrowBridge, Usurper et al. It would be cool to get those running again as a website.

Well, it's already been done for LORD: http://lotgd.net/

I can see other ways you could do that generically for every door game: 1. Run the game in local mode inside a DOSBox running under WASM in a webpage. 2. Run a classic BBS software in DOSBox using nullmodem and a virtual FOSSIL driver and have a telnet server listen and redirect connections to BBS nodes (you'd need multiple nodes). There seem to be already classic telnet BBSs doing exactly that. 3. Run a more modern BBS software like Synchronet on a modern OS and shell out to DOSBox with a nullmodem connection (and possibly a FOSSIL driver) every time someone creates a connection.

Options 2 and 3 are much more computationally expensive, but they will gave you proper multiplayer if you share folders correctly between the nodes.

I was absolutely obsessed with LORD, the greatest of door games. Looking back on it, even more than early Atari, BBS games are what turned me into a committed gamer for life. They’re the major reason why I got so into sims and RPG’s, on computers and P&P; D&D and Shadowrun in particular.

I think it’s a piece of history that really deserves preserving.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact