Open Library in particular has a very active repo with lots of volunteers, a weekly community call, and a rather accessible codebase. https://github.com/internetarchive/openlibrary
If anyone knows webpack well would LOVE to have this dev-facing issue resolve to auto reload CSS https://github.com/internetarchive/openlibrary/issues/4955
Will send moore when I have more and when I've learned to be more generous. It's good to know that you're near Internet Archive.
But, oh, what a wonderful feeling
Just to know that you are near
Sets my a heart a-reeling
From my toes up to my ears
-Bob Dylan, The man in me
Internet Archive: https://projects.propublica.org/nonprofits/organizations/943...
Wikimedia Foundation: https://projects.propublica.org/nonprofits/organizations/200...
Mozilla Foundation: https://projects.propublica.org/nonprofits/organizations/200...
Electronic Frontier Foundation: https://projects.propublica.org/nonprofits/organizations/430...
For example, all content from the old ezboard site was been removed based on the configuration of the current URL owners' robots.txt, and current URL owner is just a domain parker. Ezboard hosted a lot of content back in the day.
The question I have is how fast is the content removed after the domain name registration changes, i.e., is there is a window of time between the appearance of a new robots.txt and the next scheduled crawl, and if so, is it be possible to "rescue" the content, as ArchiveTeam would do, during that window, before it disappears.
If this is possible, there could be a service for monitoring changes to domain name registrations for sites that have large amounts of historical content. I would happily volunteer to set up such a service.
Hidden. Even when you request for them to remove stuff.
Had domain, stuff got archived, asked for them to remove it, added robots.txt. Domain lapsed. Someone else picked it up. their robots.txt now permissive, old stuff that I requested for them to remove is now visible.
Even if the owner is the same, allowing the site to be archived going forward isn't the same thing as permitting it retroactively.
What if you lost your domain, but owned it in the past? Can you delete stuff from that era?
I have to keep an old domain indefinitely to host a robots.txt just to keep sensitive personal data hidden that little me foolishly published on the open internet.
But I'm not complaining. The internet archive is a great gift. Using it with a bookmarklet really feels like a super power.
From the FAQ, they do not respect robots.txt since they only archive on request by a user and they do not remove archives unless they contain illegal content.
There is also the issue of EDNS subnet.^3 archive.is tries to require it; it wants to know what location a request is coming from. In addition to EDNS, archive.is inserts the IP address and geolocation of the incoming request into the HTML of the returned page as a tracking pixel.^4
Thus archive.is does some things archive.org does not do besides just ignoring robots.txt
One of the things archive.org does that archive.is does not do is that archive.org inserts an HTTP response header intended to disable Chrome FLoC.^5 I add this header for all sites in a local proxy; however I do not see many sites adding it as a courtesy. Thanks archive.org for doing that.
5. permissions-policy: interest-cohort=()
(I checked and ezboard is still excluded.)
People who didn't spend their Saturday mornings glued in front of the TV screen as a child of the 1970's might not remember how American kids learned about history back then:
Peabody's Improbable History - Surrender of Cornwallis
Peabody and Sherman travel back to October 19, 1781 to witness when Cornwallis surrendered for Washington. However, when they got there, then he didn't show up.
Everyone else ought to have the right to be forgotten, including some drunk tweet they wrote 10 years ago and regret, or an old personal page which contained too much PII.
Archive no longer has a way to opt-out, which is bad enough, but I still think they should be opt-in.
For example - the graffiti at Pompeii is interesting (and is pretty much at the same "quality bar" as Twitter):
In a right-to-be-forgotten world, the way it would end up going is:
1. problematic potentates punish pitiable proles
2. someone invokes right to be forgotten
3. this is considered "good enough"
4. the problem conditions that allowed #1 to fester remain uncorrected
I feel this way about a lot of stuff these days (especially the where the erosion of the tenets of a liberal society is involved), where people argue vociferously for a "solution" that can at best be considered an indirect way of handling the problem. You see this with a lot of contemporary calls for the dismantling of tradition of free speech/free inquire/freedom of association, for example. People end up chafing in the direction of proposals that have dual-use effects in the first instance and perniciously "null" effects in the second instance.
The costs seems low enough to just keep it.
Of course, to your point, they do keep it around. They don't just throw it in the trash.
* * *
It would be pretty neat if someone could figure out how to OCR all the cuneiform tablets and turn them into something searchable.
Well, it isn't named the Internet Encyclopedia, for a reason.
> Without trying to be contrarian, I don't think that everything should be archived
It isn't contrarian. The deletionists are seemingly the majority. It is contrarian to in fact archive all. the. things.
Speaking of billions: According to Kahle, Alexa Internet's compute infrastructure informed Amazon's take on IaaS (AWS) .
Another perhaps lost nugget is Amazon once funded (either in part or in full) the development of the Wayback Machine, Internet Archive's most impactful product. In addition, till date (if I'm not mistaken) Amazon continues to donate data it fetches from Alexa Toolbar installations to the Wayback Machine.
In a legal context, simply attesting to the validity of a screenshot is really common. So when that functionality is used Perma.cc is operating more as a permanent file storage service than a trusted archive.
Regardless, this does go a long way to solving the problem of dynamic sites.
FWIW: the Wayback Machine is just one part of the Internet Archive. The quoted bit accurately describes things you can do with an archive.org account, too. Readers here may be familiar with the archive.org-affiliated effort by a team specifically working to recreate the playability of old PC (and otherwise) video games with JSMESS.
> this does go a long way to solving the problem of dynamic sites
Maybe, but the "dynamic" aspect that I'm sure the other person had in mind doesn't have much to do with the D in DHTML so much as it has to do with the dynamism that arises when you have a smart server responding to requests from a fat, JS-powered frontend. It would be possible to accurately model this in and execute it from a series of static assets, in some cases, but it's rarely done.
Even many sites built with static site generators today are not going to be usable in the future. There's too much tight coupling to the environment/deployment configuration and not enough semantic richness to properly hint to the crawler what resources are necessary to archive. In the heydey of XML, it used to be a big deal to strive for machine readable documents. Today's resume-driven development-obsessed webdevs effectively cast a vote of no confidence even in HTML, doing an end-run around it daily, and figuratively holding up a middle finger to the Principle of Least Power.
To some extent, even a bunch of the projects associated with TBL's Solid initiative are guilty of doing the same.
For every Sci-Hub trying to create the library of Alexandria, there's an Elsevier trying to burn it down.
Current copyright law is largely on the side of the arsonists rather than the archivists.
(note: recipes are not copyrighted, though cookbooks are)
As I've recently come to understand, the Internet Archive itself used to have its own mailing lists for handling discussion, which interestingly enough seem to no longer be accessible—perhaps even lost.
As far as my own old posts are concerned, it looks complete :) But it isn't easily findable or searchable; the intended way of interaction is apparently to download an entire hierarchy and grep.
Is anyone working on an advertising model that achieves the basic goal of advertising, but without the centralising aspect which seems to be the root cause of many of the issues? Giant monopolies are always going to subvert regulation but the same industry as disconnected units might be easier to police. Obviously you can just try to split them up or limit their size with regulation after the fact but a good technical basis might help out.
Does web advertising just not make sense unless you can amass lots of private user data and track people across the web? If so can we subcontract that data to smaller companies we can trust with our data and effectively punish if they break the rules?
It does make sense, but it has to compete with the invasive-style advertising, and it will always lose. If you want the "good advertising" you have to kill the "bad advertising".
Fun fact: Archive.org is blocked in Bangladesh for god knows what reasons.
I donate to them with the hopes that they won't try to do anything that carries that kind of risk to their continued existence again.
Aren't they breaching copyright on a massive massive scale?
Any idea on what LexusNexus is, or was back then? Thanks!
Like your credit report, you can get a free copy by writing them and requesting a copy. IIRC when I did it a few years ago, I had to make the request in writing, I wasn't able to order it online, at least not for free.
Other sites like outline.com (which I guess is a for-profit) entity don't really allow you to get around paywalls the way the Wayback Machine does.
As someone interested in building a site that gets around paywalls for semi-educational purposes I'm curious if anyone has details!
in my head IA was just the wayback machine with cached pages. But during the pandemic I realized that there is a plethora of actual books that one can checkout with far less friction than in a standard library.
It is such a cool concept, also there are IA satellite projects (not sure if they are owned by or in partnership), for different non-english languages e.g. arquivo.pt so you can have the same plethora of content in other non-english languages as well.
Basically, most people are fine with what the Wayback Machine does and they'll take down any mirror that the domain owner asks them to.
For democratic countries that have democratically elected to restrict the distribution of some materials the society considers harmful, such as Germany with Nazi propaganda, the archive happily decides to undermine those clear, longstanding, and — disproving the single argument for free speech absolutism — not slippery-sloping anywhere over decades. Why? Because laws you disagree with are, apparently, illegitimate.
It probably helps their bold defense of all that is holy to intimately know that these really are democratic countries, which aren’t going to just send a wet team to dismember them, Saudis-style, or to spend millions on an elaborate plan whose only purpose is to let you live for another month, with a clear mind that has complete certainty that you will die, and who did it.
The anti-semitism and racism on archive.org, plus some copyright violations isn’t a byproduct of their “freedom”. It’s all of it. There are plenty of free hosts for video or documents, and an hour at minimum wage would pay for hosting quite a lot for quite a while for most of the archive’s audience. But the killer feature is immunity, through anonymity and DMCA’s Safe Harbour.
Sure, everyone here is only defending “free speech” and would never agree with the swastika-fetishists. Only, somehow, they never complain about ISIS having a hard time on Twitter, or porn being censored on Facebook. It’s the scans of Der Stürmer, especially of the 38 to 44 vintage, are the chosen symbols of “democratization”.
It's still not 100% clear but elements of it include:
- Historical racist texts (Mein Kampf, 1930s newspapers) which I feel on balance is a good thing to preserve in a library. I would assume having those papers available does more to combat fascism than to encourage it (though who knows really)
- archives of websites that are dodgy (ISIS and Neo-Nazi are mentioned but there must be all sorts of crazy and or bad stuff on the web that gets archived)
- following US law rather than local law (a general conundrum for internet sites like this, especially if you do it for countries whose laws you like but not for other countries)
- providing a place for people to upload and share files anonymously
So yeah, I'm not really a Free Speech absolutist myself (doesn't seem like many that claim they are actually believe it when it comes to things they disagree with) but doesn't feel like they're in the same boat as social media platforms who actively spread bad things if it increases ad impressions. Some similar issues around policing large amounts of content and dealing with different legal, political and moral frameworks at scale though.
I’m betting close to zero.
Unfortunately, most people 200 years from now won’t care about the 70 petabytes the Internet Archive has saved.
Don’t misunderstand: I am glad they do this and love their work. I just think we overestimate the long-term value of this info beyond a very small set of future historians or social historians.
Most people have their lives to live in this moment, and if they have a chance to look backwards before they were born, it’s not a big piece of their time.
It's a pretty good century for literature and other books.
Also, lots of people are interested in history. Those can be best sellers.
(That's only from a Russian, pretty sure actual French will add dozens more to this list)
Reading a reference 50 or even 200 years old is not absurd. A post detailing some research findings which is referenced in wikipedia is still greatly valuable. Youtube historians often reference ancient patents to uncover the history of old items.
But what got me is Youtube historians and their "ancient patents..." I want to see the ones for reinventing the wheel.
So much great literature comes from centuries before our own. And considering that the internet is likely to be around forever or as long as humans persist, a snapshot of its initial decades will one day be one of the greatest "archaeological" treasures.
Perhaps your main point is that you do not care for their work nor for the work of literature written before your time. You need not apply your yardstick to anything else in a bid to gauge its value.
An archive is not the same as a local public library. The latter holds a small collection of mostly frequently accessed items (e.g. the published works of Dickens, Austen, etc.). The former holds a much larger collection of rarely accessed items (e.g. every letter written by/to Dickens that survived, every political pamphlet published in Philadelphia in the nineteenth century , etc.).
If your point is that most items in the archive will be rarely accessed, I don't think anyone will disagree with you, but suggesting that the literature of the nineteenth century is no longer of any interest was perhaps not the best way of making that point.
I found a archive of videos in my city from 1970, a street-view like recording of select roads. I pored over it for a couple hours, noting the buildings that were still there, the completely empty hills now filled with houses, etc. That kind of stuff is really cool.
Quite a few actually, and I'm not an outlier. Plus there are many adaptations and derivative works that exist.
You are betting very wrongly. People use large amounts of older literature. Maybe not in your territory - well, be aware then that many cultures do.
>Unfortunately, most people
That the median individual should be considered a parameter is very controversial. (Contextually: services are very easily for interested minorities.)
>overestimate the long-term value
As if Project Gutenberg had not arguably been one of the most important endeavours in history.
>if they have a chance to look backwards
It is a fundamental part of education...
- Ralph Waldo Emerson
- Henry David Thoreau
- Rudyard Kipling (jungle book)
- Anna Sewell (black beauty)
- Walt Whitman (leaves of grass)
- Edgar Allan Poe
- Alexander Dumas (Count of Monte Cristo, musketeers)
- Tocqueville (democracy in America)
The value here is genuinely historical. In a hundred years, how will we track the etymology of common terms that originated in this age?
Memes make preserving the internet extremely important. Terms and ideas evolve so quickly that the history of language and thought will become obscure almost instantly. Even now it can be almost impossible to understand some internet terms if you weren't part of the subculture that spawned them at the exact time they were spawned.
Do you know what hunter2 means? Do you know it because of bash.org?
A person doesn't have to read all this material. The material has to be stored because our future society will have descended from this material, and if they don't have it they won't know how they got there.
You know, except for the bit where civilisation collapses over the next hundred years as the planet warms and hot countries invade cooler, developed countries looking for living space.