Jason Scott in general is one of my internet heroes. A lot of what he covers in his work is before my time, so it doesn't carry any nostalgia for me, but appeals on a different level altogether.
Books I've read, and classes I took in school, covered the history of the net and computers, but only the things that they feel is 'important', you get the overview of the evolution of programming languages, history of the computer, and timeline of the internet and world wide web. In each case, it feels like it's all about linking one big advancement to the next big advancement.
What you get from Jason's documentaries and presentations and sites however, is this really raw and honest look at what real people were actually doing at the time. It's hard to explain why that feels so important to me, but it does.
If you're interested in the history of the Web and its inception, I highly recommend you this (non-technical) reading [1] by Tim Berners Lee. Really inspiring.
One of my first websites was an emulation site in 1999/2000. SNES was by far the most popular category on the site so I had the bright idea to play 2-3 minutes of every single game in order to take a screenshot of the game-play. Three months later I was finally done. Then I lost everything in hard drive crash before I could push the new version of my site live. Dammit.
Man, I've been looking for the name of a Sega Genesis game for a long time and couldn't remember it. I just saw it on your site. "Atomic Rob-Kid" was one of the best games I've ever played. Super hard and super frustrating, but it has a great feel!
Ah ZX Spectrum that takes me back. I learned assembler on it, BASIC, had a Pascal and C compiler even (the last two had to load from tape). Made one mistake and had to reload the whole thing again and wait 5 minutes or so.
It was amazing how there was this closeness to the machine, you boot right into the programming environment and had to type a command to load a game or do anything.
Some of the games I remember were exceptional, Elite was one of them. Just thinking about the ability to pack everything in 48K of memory.
I wonder if the copyright office collected information about the platform a game would run on when it was copyrighted? Could probably figure some of it out by the publisher. I might have to go look for that data to mine.
I've been thinking a lot the past few weeks about the preservation of art and goods in general. I think it comes from two main sources:
1. Maciej Ceglowski (`idlewords) has been talking a lot recently about link rot. Recently, he found that 25% of items pinned a mere five years ago were dead links and 17% from only three years ago were dead [^1]. That's an incredibly high rate -- and, selfishly, one I'm noticing as my bookmark folders for recipes (RIP BroEats.com) and designs and other things is filled with more and more duds.
2. I've been reading Do Not Sell At Any Price [^2], a book about the subculture of 78rpms. These are records that are so rare and so -- for lack of a better term -- unwanted by the vast majority of the music-listening populace that the act of collecting them is less about hoarding and more about preservation. To quote one of the characters in the book (roughly from memory):
"It's a weird feeling, holding this thing in your hand and knowing that you could break the song," he said. "I snap this record in half and this song is lost forever. It's a lot of responsibility, and sometimes I think that's why I take it so seriously."
I can honestly say that, prior to reading this article, I had no idea what a ZX Spectrum was. Now, after some digging, I do -- and I still have no desire to play one, obtain one, or hold onto it in any meaningful way. (And seeing as I'm usually on the weirdly attached end of the spectrum with these kinds of things, I doubt I'm the only one.) But I'm struck by how important it is to hold onto these things, even if its in a cardboard box in a forgotten closet somewhere or a link on the Internet Archive that gets clicked once every couple decades.
I'm not positing that there will ever be a point in time that someone has the hankering to play ZX Spectrum Xtreme Chess, but I think there's inherent value in preserving this ecosystem -- something of a testament to the people who made it, the people who played it, the novelty that at one point in time there were five million living rooms with this machine in it.
The Web turned 25 this year, and it's already coming down with acute cases of memory loss. I'm hoping that by the time it hits fifty, the problem won't have gotten worse -- it will have gotten much, much better, not just with URLs but with remembering the time when people played 3D StarFighter by the Oliver Twins. [^4]
(This is a very roundabout way of saying the following: Jason, you are completely awesome for doing this, and thanks for sharing it with us.)
> The nice things about IA links is you can pretty reasonably assume that they won't suffer from link rot, right?
The catch there being, the Internet Archive retroactively respects robots.txt that forbid crawling, so if someone gets control of a domain they can block the archived pages. This is a big problem with lapsed domains that get swept under the umbrella of a holding company that has lots of domains pointing to the same content, with a blanket robots.txt.
This is a huge problem. Sites like NASA's NTRS are retroactively blocked.[1] It's not clear which user agents one must allow in robots.txt. NTRS allows archive.org_bot, but apparently ia_archiver is also needed. At some point the allow directive in the NTRS robots.txt[2] no longer matched, nuking all historical data.
A few years ago I talked to an IA engineer, who said they were planning on dealing with this by not crawling sites whose nameservers were known to point to a domain parking company. The idea was that if they never retrieved the robots.txt, they wouldn't retroactively apply it. I don't know if that filtering out of parking nameservers ever happened, and it wouldn't help for parked domains whose robots.txt they'd already retrieved, but but it would help with domains that lapse in the future.
But why retroactively remove the data? The original owner was fine with holding it, why should the snapshot be deleted because a completely different person wants his completely different website to not be crawled?
It's hard for a bot to understand concepts of 'owner' and 'completely different person' based on the data they have available. Companies can use this robots.txt feature to un-index old marketing content after a re-branding, for example. Or after an acquisition.
Sure, but, surely, the bot has timestamps saying "robots.txt allowed me to keep these documents last time I spidered them". Why do they have to be retroactively removed? robots.txt only disallows spidering, it doesn't mandate that you should delete all the data you've already spidered.
Because most of the problems come from people who want to hide old material that they didn't realize was being indexed. The automatic behavior is simple and easy to implement, and doesn't require any human intervention.
You are not mistaken. The Internet Archive does not delete or "nuke" the data that is blocked by a robots.txt. Even though cough some people believe so (see parent thread).
What you say is true for sites mirrored by IA's Wayback Machine, though my understanding is they retain the data in case the robots.txt is lifted later on.
Linking to media uploaded to the main archive itself should be safer, though.
Have you heard of other caches/archives (e.g. Google) applying the same retroactive policy? Presmably IA has no way of finding out that domain ownership has changed. I wonder if they are applying this policy to pages referenced by Wikipedia, http://blog.archive.org/2013/10/25/fixing-broken-links/
Good point, I would say both are needed. The WARC is only useful if there is a matching web browser and operating system in a VM which could render the HTML+Javascript and produce the original layout. PDF/A would mostly retain the browser layout.
I use virtual notary for this. You give it a URL, and vn fetches it, and gives you a cryptographic certificate of time of retrieval and website's content at that time.
>I've been thinking a lot the past few weeks about the preservation of art and goods in general.
I've been thinking about this myself. What i found out is that current copyright law is terrible for preservation of our culture because it makes attempts to archive, share, backup or digitalize abandoned games, music or movies illegal before their copyright expires, at which time, most of them would be probably lost foreever.
Do you mean to tell me that those crates of my dad's 78 rpm records sitting in my basement might actually be worth something? I was thinking about playing them on a 33.3 rpm turntable into my PC, then using Audacity to speed them up to 78, just to be able to hear the music again, and by association, remember my dad. Geeze, there must be at least a hundred of 'em down there, at least!
Google the work of philip jeck (samples on youtube) for hugely emotional use of the old sounds.
Try and get hold of an actual 78 rpm turntable, the old dansettes could do those speeds as could the cheaper separate turntables that had ceramic cartridges with two 'sides' (78/33).
I have a few one-off remaining records and I'm very conscious of the responsibility that goes with that. I've never played them and the only time they will be played is when there is a very high grade digitization done at the same time. I've been looking at ways to digitize them without having to actually mechanically play them (using lasers) but those systems are expensive! (http://www.elpj.com/product/price.php)
Well, sure, buying one is expensive, but you're just looking to use one. You only need one to share amongst whomever else you can find who has this need, and the Internet is really good at bringing those sorts of people together. You could Kickstart your way up to owning one of these, just as one idea. (Or work with somebody else to take it on, etc. etc., whatever.)
Kickstarter is an interesting option, never thought of that.
You could jumpstart a service that digitizes rare recordings as a service and make them available. Likely there'd be all sorts of copyright issues with a service like that but the basic principle is useful.
> as my bookmark folders for recipes (RIP BroEats.com) and designs and other things is filled with more and more duds
If you'll forgive me the self-promoting diversion, this is one of the reasons I created my recipe manager app Zest (http://plentyofzest.com/zestapp/). It's frustrating to have a recipe move/disappear, so now I can collect them all in one place.
Excellent question and this is on our roadmap (yes, I know, vapourware until we ship it). The plan is to export as HTML with standard microdata annotations (the same we use to automatically import). That should be fairly durable, as far as formats go, and even if the HTML dies at least it is structured (for conversion to a new format) and text-based (at worst you can still read the source).
Books I've read, and classes I took in school, covered the history of the net and computers, but only the things that they feel is 'important', you get the overview of the evolution of programming languages, history of the computer, and timeline of the internet and world wide web. In each case, it feels like it's all about linking one big advancement to the next big advancement.
What you get from Jason's documentaries and presentations and sites however, is this really raw and honest look at what real people were actually doing at the time. It's hard to explain why that feels so important to me, but it does.