
A Thread about Internet Archive's “Silent Killer” - danso
https://twitter.com/textfiles/status/1204428311553642496
======
commoner
The Internet Archive is absolutely essential to Wikipedia, whose articles are
required to be verifiable to reliable sources. When pages go offline, the link
rot makes it harder for articles to be verified. However, if the Wayback
Machine has an archived copy, that copy can then be cited as the source and
made available to readers and editors. The Wayback Machine now automatically
archives every new external link added to the English Wikipedia.

[https://en.wikipedia.org/wiki/Wikipedia:Link_rot](https://en.wikipedia.org/wiki/Wikipedia:Link_rot)

The Wikimedia Foundation's budget is about 10 times that of the Internet
Archive. If you see the fundraising banner on Wikipedia and want to help out
the site, but don't think the Wikimedia Foundation needs the donation,
consider donating to the Internet Archive instead.

[https://archive.org/donate](https://archive.org/donate)

~~~
ignoramous
> The Wayback Machine now automatically archives every new external link added
> to the English Wikipedia.

Links submitted to Hacker News should be auto-archived, too: I often stumbled
upon dead-links [0] which otherwise had generated insightful discussion on
news.yc. Adding _archive_ next to _web_ would work nicely.

[0] Launch HNs, in particular.

~~~
scrollaway
It is possible to do that nowadays and that is one of the things I suggested
implementing in tildes.

[https://gitlab.com/tildes/tildes/issues/586](https://gitlab.com/tildes/tildes/issues/586)

I believe all link aggregators should do this, they're a great way of getting
defacto quality/interesting content discovered by the archive.

~~~
dang
> A mechanism to save pages to the web archive has now been implemented. Just
> need to ping this URL:
> [http://web.archive.org/save/<insert](http://web.archive.org/save/<insert)
> URL you want to save>.

It's been on our list to do that for some time, but this should make it much
easier. Thanks!

~~~
Sukotto
I hope you folks implement this, and also that YC sets a recurring donation to
the archive. (Possibly tied to how many URLs you submit)

------
tetha
I've discovered their efforts to archive old dos games... fully playable in
the browser through a dosbox build in WASM as far as I know. That's a really
impressive cooperation of very old and very new technology - 16 bit up to the
edge of JS (- edit: or rather, the edge of browser based computing).

And on that train of thought, I just had a little flashback about storage
sizes. There was some time when 3.14 MB was a big unit of measurement, and
some 50MB drive was huge. The classical example: Monkey Island, Indiana Jones
or Day of the Tentacle on ~12 floppy disks. But you had to choose which 1 or 2
to install because you didn't have enough space on your hard drive, or you had
to swap disks every few screens :)

Or, they are archiving a lot of video playthroughs in the lets-play style from
video sites, in case those sites go through a meltdown like viddler does.

I guess I'm rambling. Point is: This isn't just a storage dump. There are also
interesting projects around the internet archive to make the old things
accessible on new systems. Very worthwhile donating to.

~~~
theandrewbailey
> There was some time when 3.14 MB was a big unit of measurement

I don't know what storage medium that refers to. Plain old 3.5" floppies only
went up to 1.44 MB.

~~~
SamBam
You know how old computers always rounded things to the nearest _pi_...

~~~
B1FF_PSUVM
Get out of here.

(Still sore Apple did not stick to the 1kB = 1000 bytes of the original MacOS
and gave in to the "powers of two" insanity ;-)

~~~
codetrotter
I don't get why you are being downvoted. Are people not familiar with the
expression "get out of here" used in a friendly manner? Or do they really
downvote you because they think it is so bad that you think that 1kB should be
equal to 1000 bytes? I mean, I don't agree with you either but IMO the
downvote should only be used for things that are either factually wrong, or
which detract strongly from the conversation, or which contribute nothing at
all. I think your comment is nice and it does not deserve to be downvoted.

~~~
lioeters
Yeah, apparently downvotes can be used as a "disagree" signal. Usually when I
see a comment being unfairly downvoted that way, I upvote as a counter-
balance, even if I disagree with the argument (i.e., how many bytes in 1kB).

------
redisman
I encourage everyone to also do their own thinking of what did you enjoy in
the internet or computer software in the 80's, 90's and 00's and see if it is
still available somewhere.

I found out that the local games scene I used to love as a kid had been almost
wiped off the face off the earth. These were small non-commercial and
shareware games localized in just one language so it was already a niche. The
free hosting services of the 90s are gone so those sites are down and no one
wanted to keep paying for hosting for 20 years+ on sites that get very few
visitors nowadays.

The only way to get these games again was to find a Discord group and a
friendly stranger who agreed to seed a torrent (which had 0 seeds when I found
it). I'm looking to upload them to a couple of different places and compile a
basic website catalogue (static site on CDN) one of these days. For the
layperson, these games are already gone from the internet.

The internet archive does a great service but it is breadth-first and quite
surface level. The depth has to come from people who were familiar with the
sites at their peak. And there's a big change that no one is doing that for
your specific interest.

~~~
foxthatruns
BlueMaxima's Flashpoint is a fabulous archival project that is saving as many
Adobe Flash games/animations as they can before browsers pull support at the
end of 2020. Really cool, since Flash games are a similarly concrete slice of
culture/history that will just be gone if they're not archived.

[https://bluemaxima.org/flashpoint/](https://bluemaxima.org/flashpoint/)

~~~
baroffoos
Looks like its a huge amount of work to strip the DRM out of the games which
stop functioning when the original website stops working.

------
starsinspace
Although efforts like internet archive are noble (and I find it occasionally
useful), I'm not sure it's always so great that everything anyone does online
will be permanently archived.

I know many people feel that everything should be available forever. But for
me... it's pushing me away from doing much on the web. I liked it in the 90s
when things were more ephemeral. When you could make mistakes and not have
them easily found by anyone with a few clicks, forever.

~~~
VonGuard
You're right, let's burn the library down because one book has a liable
chapter in it.

This argument is so horrible as to be actively harmful to Archive's work.
Jason Scott is a god, and if we didn't have him, we'd have to invent him.

WE DO NOT GET TO CHOOSE WHAT THE FUTURE FINDS INTERESTING.

We live in the only point in human history where we can actually save all of
humanity's knowledge and culture, and we can do so without having to worry
about physical space or staff to work the "library." It's a remarkable time we
live in, and yet, 99% of our society either doesn't care, thinks this work is
stupid, or actively works against it through horrific copyright laws.

We know more about how Rembrandt painted and lived than we do about how Atari
2600 programmers worked and lived. I can go to Rembrandt's house and see where
he lived, where he painted, how he worked, where he slept and ate and mixed
his paints and taught his classes.

Atari's old HQ is just another office building. The source code to those games
is mostly gone (thankfully, it's assembly and easier to disassemble). We need
to save our culture and digital heritage, else we forget where we come from.

Deleting some old tweets is one thing, but actively worrying about Archive's
work is just harmful to us all. We need 10,000 more Archives, dammit. It's
supremely important work that is helping stem the tide of lost culture due to
stock market forces. Geocities is gone forever because Yahoo! didn't find it
profitable. This cannot keep happening.

~~~
ballenf
I’m not convinced it’s dangerous to explore whether there are benefits to
ephemerality.

I’m also not sure your Rembrandt example shows what you suggest it does. The
average Atari 2600 programmer would be more equivalent to the hundreds of now
unknown artists in Rembrandt’s time. The John Carmack’s of today will be
remembered in detail with or without blanket archive efforts.

Maybe, just maybe, Rembrandt’s Status in our minds is a result of generations
of people each seeing the individual value in his work. That is, each
generation does indeed get to decide what future generations remember. Or at
least it used to be true until the digital age.

Maybe the change is an improvement. But maybe not.

And libraries are the epitome of what you’re fighting against. They are by
definition works chosen by humans based on judgment calls of their perceived
value.

Let’s at least acknowledge that blanket archive efforts are a fundamental
change in themselves and a departure from the human status quo for thousands
of years. Then let’s debate whether the change is an unabated good.

~~~
VonGuard
In 2008 I found a parcel of bare EPROMs at a flea market container 27 games. 1
of those games was Cabbage Patch Kids Adventures in the Park, and it was
spread across 12 chips, each one showing a progressive state of development
across 9 months.

To my mind, this was the only known find of a vintage Atari 2600 game and its
iterative development process. So, 30 years later, the only reason we had this
snapshot is because someone found these chips and sold them at the flea.

The current state of digital preservation is abhorrent. Those roms would have
taken up less than 1/4 of a 5.25" floppy, but the company behind them never
thought to preserve that information or data.

Take2 Interactive republished BioShock in 2012. They couldn't find their
source code. They didn't save it. They had to go machine to machine looking
for it. The reissued game is not the same as the original.

As a society, we don't place any value on this stuff, but the potential value
of it cannot be understood until the future has occurred. Letting it vanish is
a disservice to the future. In the past, if a book was published, it wasn't
going to vanish if the publisher went out of business, there would simply be
no new copies.

In our digital online age, things vanish in seconds, days and hours. This is
also a very different state of affairs. In the past we could not save
everything, but everything didn't have a clock counting down from the end of
the quarter over its head, counting the seconds until it is deleted.

The Library of Congress tries to save everything. Yes, libraries weed the
stacks and choose items to host. This is due to space concerns: they can't
host everything ever. Digitally, they can, and many host reams of microfilm
and old newspapers because they can.

Libraries can, thanks to tech, now host every book ever, digitally, for very
low costs. Copyright prevents that.

This is an unabated good. Leaving things behind and forgetting them is how you
get Tulsa Oaklahoma, or the Armenian Genocide denials. We don't get to choose
what the future finds interesting, and for the first time in history, we do
not have to. Why in the every loving fuck would you worry about that?

Most likely, only for personal reasons. This is a humanity level problem. Your
personal worries are irrelevant in 100 years when everyone who ever knew you
is dead anyway. Geocities would be more interesting at that time, as a subject
of study.

~~~
NeedMoreTea
Library of Congress, British Library, Bibliothèque Nationale etc choose to
save everything they are mandated to, and a fair bit extra besides. That
includes everything published. They don't save their water cooler chats,
personal letters and everything sent by post, everything said on the phone or
Facebook, etc.

The bar - perhaps found accidentally - seems quite important in deciding what
_must_ be archived, and what probably shouldn't.

~~~
dmitriid
And yet, hundreds of years later historians and linguists crave for letters,
and post, and telegrams to get a glimpse of actual life outside official
publications.

~~~
NeedMoreTea
Sure, and a hundred or more years later the family of the author, or relatives
of the recipient can decide to release the family letters or telegram from WW1
or the US Civil War etc. That delay, usually at least until the correspondents
have died, is important. The affair, the less than ideal belief, and all that
other imperfect demonstration of humanity can no longer hurt or embarrass. It
ceases to be private and personal and moves into the historic.

Releasing whilst the probably famous sender is alive is most often in the
realms of to do damage, simply tasteless or paid for revelations in the gutter
press.

------
joe_the_user
So, copyright conditions are apparently another silent killer [1].

A website can be archived, vanish from the web, and then vanish from the
archive for technical copyright reasons (new owner's robot.txt file on the
root). So "archiving the archive" might be useful. Or something.

Ezboard was an old discussion site that contained much of interest - archived
and now the archive is not accessible.

[https://archive.org/post/389127/ezboard-content-suddenly-
not...](https://archive.org/post/389127/ezboard-content-suddenly-not-
available-in-the-new-system-why)

~~~
tgsovlerkhgsel
TIL they still follow robots.txt --
[https://blog.archive.org/2017/04/17/robots-txt-meant-for-
sea...](https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-
engines-dont-work-well-for-web-archives/) mentioned that they were planning to
stop doing that (and I remember reading a news article based on it claiming
that they already stopped following robots.txt, hence my confusion). Truly a
shame. I get following robots.txt at collection time, I don't get following a
robots.txt that was added later.

~~~
joe_the_user
Well that was 2 and 3/4 years ago and ezboard is still gone. I hope they
didn't erase but rather excluded but legality and all that.

------
mastazi
> A shutdown announcement put these at risk. We worked with the founder, @tedr
> (RIP), who'd left the company, to save as much as we could. 3.5 Terabytes

I work for a company that owns a website containing information that in my
opinion is valuable to the public. The website may go offline forever in the
coming months. How can I get in touch with the Archive, to ensure that the
content is saved? Parts of the content are not easy to index (e.g. there are
"hidden" pages that you will only find if you have the exact URL), I can
assist with that.

~~~
rahuldottech
Other than irc, you can send a DM to
[https://twitter.com/textfiles](https://twitter.com/textfiles), or email at
jason@textfiles.com.

~~~
mastazi
Thanks!

------
jedberg
It seems they could save some money by moving a bunch of infrequently accessed
data to warm storage. The entire archive does not need to be accessible 24/7.

I would be perfectly ok if I was trying to see a copy of a web page from five
years ago, and it said that I had to make a request and it would be available
in five or ten minutes.

I think I could wait five or ten minutes for a web page to get pulled from the
archives.

~~~
anamexis
Practically speaking, how would warm storage like that be implemented?

~~~
dredmorbius
Usually tape libraries.

A problem with most forms of warm storage is that they involve moving media
around, which itself can lead to degradation.

I would like to see some write-one _and really last forever_ form of storage
to come into widespread use. That seems ... like it's still a few years off.

And you still have the issue of how such storage is read, and how that affects
total lifetime.

~~~
ccostes
Sounds like what they're doing at Microsoft with Project Silica
[https://news.microsoft.com/innovation-stories/ignite-
project...](https://news.microsoft.com/innovation-stories/ignite-project-
silica-superman/)

~~~
dredmorbius
That definitely is in the direction I'm considering.

------
bpaddock
The Food and Drug Administration (FDA) has been archiving the FDA.gov site at
archive-it.org .

Leaving a lot of dead links on the FDA site. Sometimes they tell you to look
in the archives for the old information, without giving you a link to it, and
sometimes they don't, they just expect you to know.

Now why can't the FDA afford the space to keep their pages forever on their
own site? Fill in your favorite conspiracy theory...

Some of the information that has been removed, such as the 2015 hearings on
Fluoroquinolone antibiotics, are important health research as just one
example.

[https://archive-it.org/organizations/1137](https://archive-
it.org/organizations/1137)

~~~
Analemma_
> Now why can't the FDA afford the space to keep their pages forever on their
> own site? Fill in your favorite conspiracy theory...

This seems like a prime example of Hanlon's razor. Tight government budgets
and lowest-bidder contractors not bothering with page permanence strike me as
the most likely explanation.

~~~
ghaff
It may also be worth observing that it's how pretty much any private company
website operates. Arguably the FDA should be different and archive old, even
outdated, content. But, while private companies may explicitly archive some
materials like press releases and earnings reports, 99.9% of their focus is on
the current content and they'll mostly just delete anything that's not in
service of today.

------
tyingq
Pretty amazing what they do with 1/10th the revenue of Wikimedia and quite a
lot more data to manage.

~~~
dredmorbius
IA are a capture process. Wikipedia supports active updates to content, under
contentious circumstances, as well as structuring content for useful access
(something IA also does, but under looser conditions).

Both are highly worthwhile projects -- some of the best of the Web. But a
straight apples-pomellos comparision is difficult.

~~~
tyingq
Wikipedia also gets a lot of that done with free labor. And, over the years,
their spend has increased quite a lot more than their content, update
frequency, and so on.

I think it's a fair enough comparison.

------
notacoward
The real "silent killer" I see here is the reliance on mirroring. One-failure
protection, with a 2x expansion factor. As it happens, I work on large storage
systems, where 2x is our maximum expansion factor and for that we get
resistance to as many as _nine_ simultaneous failures. Across power and
network failure domains, with multiple kinds of background scrubbing to detect
loss of that redundancy. Oh, and 60PB is something we might add to an existing
cluster for a day to absorb transient I/O load. There's also a bunch of
monitoring and automation stuff that should be considered "table stakes" for
storage at these scales. Seems like an opportunity to use what I and others
have learned for a good cause, to make this valuable resource more efficient
_and_ more durable all at once.

~~~
scarejunba
Is that different techniques or you just have vastly more resources? I'm sure
they'd accept your help in either case if you offered directly.

~~~
notacoward
It's different techniques (erasure coding). As it turns out, applying those
techniques would _reduce_ need for additional physical resources as well,
since it allows data to be stored more efficiently with the same physical
resources. It does require more CPU and memory relative to bytes stored or
spindles to store them on, but it's easy to make that tradeoff and still come
out ahead.

------
maxton
I donated a week or so ago. The Internet Archive has come in handy many times
for me, not just for the Wayback Machine but also things like their live music
archives. They're an indispensable resource.

~~~
giancarlostoro
They also archive ROMs for different consoles.

~~~
_sbrk
Which is a copyright violation, right?

~~~
giancarlostoro
ROMs are legal at least here in the USA. Doesn't stop big companies from
bullying little sites and becoming copyright trolls. The Internet archive will
not be easily bullied to take down legally archived items thankfully.

The trick is you have to create the ROM yourself apparently. Sharing it is
bad. I would assume copies made by anybody would be identical, so it would be
interesting to see such a case in court against The Internet Archive, not that
I want them to get sued, but would love for them to win if they did.

Nintendo is the most aggressive about ROMs than anybody.

~~~
chongli
_Nintendo is the most aggressive about ROMs than anybody_

They are, but they’re also the most proactive in making their old games
available on new hardware. It wouldn’t be fair to call Super Mario Bros
“abandonware” because Nintendo has done a lot to keep the game alive and
accessible to new generations.

~~~
vadansky
They have been caught putting other people's "illegal" roms on VC

~~~
giancarlostoro
[https://www.eurogamer.net/articles/2017-01-18-did-
nintendo-d...](https://www.eurogamer.net/articles/2017-01-18-did-nintendo-
download-a-mario-rom-and-sell-it-back-to-us)

Here's an article that asks this very question, not sure if it was posted on
HN prior or not.

------
nym
Setup a monthly $5 donation. It's not much, but I know how valuable the
archive is... they're doing the work the library of congress SHOULD be doing.

~~~
criddell
> I know how valuable the archive is

How valuable is it?

~~~
Avamander
Data is generally hard to put a value on.

~~~
tingletech
When I took an animal communication class in college, there was a formula for
the value of information (not data).

I don't remember the exact formula, but it was similar Shannon's equation.
Basically, the more valuable information is, the larger the change it affects
in the probably of what an organisms next behavioral state is. So, if
information signaling didn't change the behavior of another organism, it
wasn't considered communication.

------
toomuchtodo
Internet Archive donations are matched 2-1 currently, for those interested.

[https://archive.org/donate/](https://archive.org/donate/)

~~~
Akababa
Who's matching the donations?

~~~
aeyes
I guess the Pineapple Fund

[https://pineapplefund.org/](https://pineapplefund.org/)

~~~
textfiles
It is not the pineapple fund, although it has been a lovely time working with
them. Regardless of the cryptocurrency discussion, the fact that someone who
feels some sort of windfall would immediately have an urge to share it with
organizations that need it is laudable.

------
duelingjello
IA and WBM are great and essential, like a Library of Congress/Smithsonian.
What's frustrating about some old websites like Microsoft or Borland's FTP
download area is that dynamic links weren't followed and can't be followed and
websites that used user-agent filtering. CDN links also weren't captured well.

There's so many retro patches that just don't exist publicly. For example, a
number of files on SciTech's IA's WBM have zero captures. Most FTP sites
weren't captured in WBM adequately either. There are spots of FTP archives
hosted here and there on IA and elsewhere, but they're not like WBM for static
content sites, and a single snapshot archive lacks the history and the
changes, before and after. It is what it is, unless folks donate their vintage
personal/work local mirrors to add to the collective.

~~~
nine_k
Donating your copy whenno authoritative source exists is a great chance to
retroactively update the past.

Anything that was not securely scanned by WBM but was donated should have a
different and clearly denoted status, if admitted.

------
phendrenad2
The Internet Archive is great, but it risks becoming a single failure point if
we rely on it too much. Also it would be good to take some of the server load
off of them. One possible solution is for smaller archives to exist. So if
you're interested in archival, and you have some spare time and cash, consider
not only donating to IA, but also setting up your own archive site with
content on whatever category or topic that you found interesting enough to
archive.

~~~
weystrom
Exactly, decentralization is the way to go. Data hoarding is a hobby. Just
checkout r/datahoarder.

------
rikroots
The British Library's web archive[1] does similar archiving work, but limits
itself to 'British' sites - in other words, sites with a .uk domain. I've had
good dealings with the admins before, when I submitted some of my personal
sites for inclusion in the archive.

Interestingly, the British Library uses web crawlers based on the Internet
Archive's Heritrix web crawler[2], which demonstrates how important IA's work
is for many other archival organisations' work.

[1] [https://www.webarchive.org.uk/](https://www.webarchive.org.uk/) [2]
[https://github.com/internetarchive/heritrix3](https://github.com/internetarchive/heritrix3)

------
echelon
A linear amount of data could be saved if we extricated text content from the
HTML skeleton that contains it.

I wish Semantic Web had taken off. "Pages with styling" was suboptimal. Web
apps are a such a weird evolutionary branch we've descended into that don't
relate to documents.

Content instead should have fallen under a type of ontology: news item, blog
post, technical reference, comment, status update, ... If we'd adopted such a
markup grammar and styled around it, we could parse out meaning, have stronger
links in the graph, and compress.

Semantic Web would have happened if commercial web didn't outpace it.

~~~
account42
Once you start transforming a document you always risk discarding information
that later will turn out to be useful. Besides, I bet images, videos and other
non-text data dwarf the space required by HTML markup.

~~~
echelon
Yes, that's what I expect to be the case. I'm just musing over a world built
in semantic markup rather than presentational with arbitrary structure.

------
toyg
TIA should get money from the UN. Or be the only beneficiary of a flat tax on
network ports - I bet even just $5 on every small router sold in the US (which
is basically nothing) would generate a ton of money for them.

------
alwillis
Let’s not forget that we can all participate in archiving web data using
IPFS[1], which the Internet Archive is also using.

And coming soon, you’ll be able to get paid to make content available using
IPFS and FileCoin[2].

[1]: [https://ipfs.io/](https://ipfs.io/)

[2]: [https://filecoin.io/](https://filecoin.io/)

------
mirimir
It's too bad that there's not a vaguely-somehow-related-but-not-really and
impossible-to-censor service that retains stuff that sites have excluded using
robots.txt or whatever.

~~~
333c
Is this supposed to be a veiled reference to archive.is? I'm not sure what
point you're making.

~~~
mirimir
No, I wasn't thinking of that site _per se_.

I was thinking more about the 90s era dream of uncensurable "data havens".
That led to Freenet, for example. Which is slow, and forgets stuff that
doesn't get accessed. And Tor onion services, which are more readily taken
down.

But the problem is that there's no way to know in advance whether something is
about to disappear from the Internet Archive. So you'd need someone inside
who'd discreetly alert the backup service.

------
mellosouls
The Internet Archive is a fantastic and important initiative and we should
definitely support it.

 _But:_

Let's also support the public service players in its space that often get
forgotten or marginalised by it's well funded marketing.

I'm thinking particularly of the various national libraries that preserve
content under incredibly tight budgetary, PR and legal constraints that the IA
is relatively free of.

On IA-aware media like HN, there is a tendency to present it as the only
preservation initiative out there, which is absolutely not the case.

[https://en.m.wikipedia.org/wiki/List_of_Web_archiving_initia...](https://en.m.wikipedia.org/wiki/List_of_Web_archiving_initiatives)

------
sdhankar
Some things about internet archives which isnt as obvious unless one has been
there in person \- Housed in a church its not what a typical software
development shop might look like, its mix of open office and workshop floor.
\- A big workforce actually does a lot of manual work of scanning books and
transferring from different media types \- Its a interesting experience to
tour the data center which is in the office itself. I cannot remember but
there was something special about those machines to control the heating

------
PostOnce
I don't understand why archive.org keeps all kinds of formats, .ogg and .mp3,
.pdf and .jpg for the same resource? Why not just whatever the original format
was? Sometimes its a dozen or so formats it seems like.

There are probably a lot of reasons, but I don't know where to chat about it
to learn more.

edit: now that I think about it, if the original is pdf, then jpeg makes sense
for loading one-page-at-a-time in the browser, but it seems like mp3
transcoding to ogg is reasonable to leave up to the user?

------
corporate_shi11
I found out about the Wayback Machine back in 2007 or so. I was young and just
starting to explore the internet and found it incredibly fascinating to
explore news sites as they were on the day of major historical events (9/11,
etc). The amount of times I've relied on the Internet Archive by way of the
Wayback Machine is numerous. I'll be adding them to my modest donation list,
along with Wikipedia and similar public service sites.

------
azinman2
PSA - Many companies will match donations. Apple, my employer, does this
through a site called Benevity.

------
rebuilder
The Internet Archive is amazing in many ways. Kind if sadly, for me, the most
amazing thing is that they haven't been sued out of existence. How do they
manage to operate with copyright laws being what they are?

~~~
earenndil
They've been granted a DMCA exemption.

------
gravypod
Is there any way to download warcs collected by the internet archive? I'd like
to try and back some up and use them for some search algorithm benchmarks.

------
S_A_P
This is one of the last bastions of the internet I remember as a kid. People
publishing knowledge because they could. Sharing things that were interesting.
Giving to other people. Sure there has always been seedier elements or
commerce, but by and large this was a place to grow and share knowledge. By
most every measure today the internet is a better place than it was but I
still miss those days...

------
Bnshsysjab
I’m somewhat curious about their backup methods - mirrored drives aren’t great
even if they’re stored at two seperate locations.

Surely this would get enough support that they could host torrents of the
content stored in chunks, and have many peers download and seed many chunks
making the backups entirely distributed? I’d gladly seed a few hundred gigs of
data to ensure they maintain good backup procedures.

~~~
jve
Not sure why are you being downvoted. That was actually the first thing I
thought - what is the backup strategy? Mirroring is for performance + service
continuity.

Let me remind that there already was a fire that destroyed some equipment for
internet archive but luckily not digitalized data:
[https://www.theverge.com/2013/11/7/5076166/the-internet-
arch...](https://www.theverge.com/2013/11/7/5076166/the-internet-archive-
seeks-donations-after-fire-destroys-equipment)

Luckily that article from 2013 mentions: > None of the Internet Archive's
digitized data was lost in the fire as backups are held in multiple locations.

------
wolfgke
Don't forget [https://www.ft.com/content/5be1f2ee-d60b-11e9-a0bd-
ab8ec6435...](https://www.ft.com/content/5be1f2ee-d60b-11e9-a0bd-ab8ec6435630)
(discussed under
[https://news.ycombinator.com/item?id=21007476](https://news.ycombinator.com/item?id=21007476)).

------
maxander
Perhaps they could get a grant from OpenAI or some similar AI research firm;
someday, when the technology arrives, the Internet Archive will be the
ultimate corpus.

------
saagarjha
One of the replies on Twitter exhorts them to "charge people" in all caps.
It'd be a sad to a library put behind a paywall :(

~~~
ghaff
While I tend to agree in this case, lots of libraries are behind a "paywall"
in some form or other. The are explicitly membership-only private libraries
but many, e.g. university libraries, also restrict access.

------
prirun
I wrote a Prime minicomputer emulator some years ago and put it online (telnet
em.prirun.com 8001). The Prime was a minicomputer from the 80's. I worked with
Primes for many years and also worked at the company as an OS specialist for
18 months.

Seven versions of the Primos operating system, from rev 18 to rev 24, have
been recovered from disks, 9-track tapes, and 8mm backup tapes. The company
died around 1992.

There has been a _huge_ amount of Prime software that as far as I know, is
lost to time. Oracle ran on it. SPSS. A native DBMS. Every one of the major OS
revs had 5-10 minor revs. Some software products are only available for
certain revs. I actually used rev 12, so at least half of the OS versions are
completely missing. And Prime released source for their products. Most of that
is missing.

For me personally, the emulator was / is a very rewarding project, and maybe
for a handful of others who are still alive and used Primes in high school,
college, or work. It's been really fascinating to "relive" my Prime days now
and then, and others have made similar comments.

But how valuable is it really? Is it more than just a curiosity now? Sure, for
a current or future computer historian, it might be viewed as a gold mine, but
as time goes on, it has less and less value IMO, especially as the people who
actual used one die off. If in a few years, only 10 people in the world care
about such a thing, does it make sense to save it for all time? I have my
doubts. And if I did have _all_ of the versions of Primos, and all of the
software ever written for Primes, would it make a difference even now, when
there are people alive who actually used this computer system? Also seems
doubtful.

To me, an "archive everything and let someone else figure out if it's useful"
strategy seems impractical. If I had my choice, sure, I'd like to have every
major version of the OS and all of the products for each version. But all of
the dot-revs too? Nope - not important. Would I like _all_ of the manuals?
Yep!

Some of the products like Oracle, DBMS, etc. were a bitch to configure back in
the day, and certainly would be even harder now with very few people around to
help. So even if I did have them, I doubt I could get them running. And even
if they were running, they were a specialty product at the time, with few
experts, so finding someone today that would be interested in them would be a
needle / haystack thing.

It all reminds me of family pictures. Before my grandma died, she went through
a large box of B/W photos, explaining to us who all the people were, and we
wrote on the back of the photos. But just 3 generations after her, the
generation after me, no one knows the people in the photos (except my mom is
in some - their grandma). In a way it seems that this family history is
somehow important, but no one younger than me is interested _at all_. And
truthfully, I never look at the pictures myself. I ended up scanning the
photos with my mom or grandma, putting them on a digital picture frame for my
nephews, and leaving the rest out. I asked them first - they didn't care about
any of the others.

I'm not an archivist, but it seems to me that curation is a much more
important aspect than grabbing everything and keeping it alive forever. Not
curating seems like a "scale to infinity" problem. Those tend to not end well.

I think the IA is great project and am not intending to be critical, but
giving my own perspective about preserving some very small piece of computer
history.

------
gatherhunterer
Twitter is the worst choice for publishing. This involved so much more text-
scrolling and photo-expanding than is necessary.

Edit: If anyone would care to explain why they feel that chopping an article
up into pieces of arbitrary size makes for a good user experience, please do
so. I know Twitter gets a pass from HN in general but it is simply not suited
to this type of content by design. This was a conscious choice by its
creators.

~~~
jon-wood
There really needs to be a new rule added to the posting guidelines about not
complaining about the format when Twitter threads make it to the front page.
No, it’s probably not the best possible format, but every single time the
comments are inundated.

~~~
decebalus1
There really needs to be a new rule added to the posting guidelines about not
posting Twitter threads

------
lazylizard
What does marie kondo think?

~~~
catalogia
What does lazylizard think?

------
peterwwillis
It seems like all those things they backed up didn't seem to impact anyone's
actual life. People lost files, websites, communications, whatever, and yet
they kept right on living life. So I can't help but think that this is all a
waste, like people who take pictures everywhere they go, and never look at
them later.

Is there any useful purpose to all this other than nostalgia? And at what
point does the nostalgia overwhelm the useful purpose?

~~~
jborichevskiy
Seems to me like it could come in very useful in a legal case, with very
significant and real impacts on someone's life.

~~~
Zitrax
Also useful for various research, in the same way as going to the archives in
a library can be useful.

Even if most data is uninteresting doesn't mean that there are data there that
is very useful to someone.

------
HenryKissinger
Why not just let old data die? What's the point in devoting 10TB to saving
data which maybe 300 people will check in the next 25 years?

Information destruction is not necessarily a bad thing.

~~~
tingletech
Selection is destruction, and destruction is selection. (but not even a
blackhole can destroy information?)

------
msla
Here's what I thought this would be about:

There's a problem with the Wayback Machine in specific which can kill your
ability to access it quite silently, unless you know how to use the browser's
development tools and interpret headers.

It has to do with cookies: Somehow, the Wayback Machine sets cookies... and
sets cookies... and keeps setting cookies, until it overflows its own ability
to _accept_ cookies. At that point, your browser tries to access a Wayback
Machine page, handing the server all of the cookies it currently has, and the
server refuses to deal. It absolutely denies everything, sending an error
header and a blank page. You have to clear all web.archive.org cookies to get
anything at all, at which point it works perfectly.

I've completely solved this problem by blacklisting web.archive.org in browser
cookie blacklists. I haven't had it happen since then. As far as I'm
concerned, the problem is diagnosed and just needs to be solved. At _their_
end.

~~~
topherPedersen
I've run into this problem as well. I used to use a Chrome extension which I
authored that saved all of the pages I browsed to the wayback machine (in
another tab), and I would frequently need to clear my cookies to keep the
thing working.

