Hacker News new | past | comments | ask | show | jobs | submit login
Addressing Recent Claims of “Manipulated” Blog Posts in the Wayback Machine (archive.org)
281 points by edward on Apr 25, 2018 | hide | past | web | favorite | 214 comments

They aren't very prominent, but the two links at the top of the blog post provide useful context:



The summary is that last year Reid (a journalist for a major US television news network) publicly apologized for a series of blog posts that were characterized as 'homophobic'. More posts of a similar nature were recently discovered on archive.org, and instead of apologizing for these as well, Reid has disavowed them. She and her lawyers claim that unlike the previous occasion, these newly discovered posts were altered by 'hackers' either before or after being archived. The linked blog post is making the limited claim the posts on archive.org accurately represent the posts present on Reid's site at the time they were archived, and do not appear to have been altered post-archiving.

The linked blog post is making the limited claim the posts on archive.org accurately represent the posts present on Reid's site at the time they were archived, and do not appear to have been altered post-archiving.

This might actually be a good use case for a blockchain. Hashing the data that's added to the archive and then putting the hash in the blockchain would reasonably prove the data in the archive hasn't been modified at a later date.

I do agree that a tamper-resistant store would be useful for things like journalism, legislature, official government communication, campaign content for politicians, etc. A distributed ledger for these would also be good because then you’re verifying that store in public view.

It’s too bad all you can build with that is a meager, profitable SaaS business, not a wild speculative crypto-billionaire rocket ride.

The National Research Council of Canada is doing just this for public auditing of their allotted grants and funding:


Using a company called Catena:


What about “right to be forgotten” laws?

The hash chain doesn't contain the data, only a hash of the data. So the original article can still be altered, and the hash chain would only prove that it had been changed. I believe nothing in these "right to be forgotten" laws forbid noting that an article has been edited to remove names.

An interesting alternative would be to hash "chunks" of the original article so that a future verification could be applied to particular parts of the content. Let's imagine you hashed every 32 bytes, you could then determine which chunks changed at what times, without revealing the plain text content.

The question of how to identify large complex works, of potentially variable forms (markup or editing format, HTML, PDF, ePub, .mobi, etc.) such that changes and correspondences can be noted, is a question I've been kicking around.

Chunk-based hashing could work. Corresponding those to document structure (paragraphs, sentences, chapters, ...) might make more sense.

Yeah that's an interesting question. How to parse the content into meaningful pieces and then hash in such a way that the content is not known, but the hash can be mapped to where it was in the document at an earlier time.

Keep in mind that at the scale of a large work, some level of looseness may be sufficient. Identity and integrity are distinct, the former is largely based on metadata (author, title, publication date, assigned identifiers such as ISBN, OCLC, DOI, etc.). Integrity is largely presumed unless challenged.

I'm familiar with RDA, FRBR, and WEMI, somewhat.


You could also check that only names have been changed by hashing the article after replacing the names (if known).

As it pertains to private citizens, I would not recommend something like this to archive or verify their personal data. But for government records, campaign records, etc, I would think that those laws do not apply to that information.

They will be as effective as anti-piracy law would be if pirates were paid to seed. At best they will prevent respectable publications from directly using distributed archives as a source.

> Hashing the data that's added to the archive and then putting the hash in the blockchain would reasonably prove the data in the archive hasn't been modified at a later date.

Here have yourself a timestamp https://petertodd.org/2016/opentimestamps-announcement


    pip3 install opentimestamps-client

    ots stamp myfile.txt

In fact, I probably have timestamps for the raw WARC files in the wayback machine archive which would prove that any hacking must have happened prior to May 2017: https://petertodd.org/2017/carbon-dating-the-internet-archiv...

Unfortunately, the files themselves aren't public, and each file contains dumps from hundreds of websites, so even if they were public they're not the easiest thing to verify.

Still, being the guy behind OpenTimestamps I should point out that in this case I don't think timestamp proofs really add that much: Reid's claims seem dubious even without cryptographic proof.

The first link[1] in that post (for "searchable database") is a 404.

1. https://opentimestamps.org/internet-archive/

Thanks! I'll take a look at that.

> This might actually be a good use case for a blockchain.

Or they could just implement trusted timestamping (RFC 3161). Using a blockchain is a heavy-weight solution and is rarely the right one.


RFC3161 has very poor security, as it blindly trusts certificate authorities.

You really need better auditing than that, which is why the certificate authority infrastructure now relies on a blockchain - Certificate Transparency - for auditing. Similarly, for timestamping specifically, Guardtime has used a blockchain for auditing their timestamps since well before blockchains got called blockchains.

So here's something I can't get a straight answer on:

Surely if content is served over HTTPS with a valid certificate, it should be possible to save (possibly as part of a WARC) a "signature" of the TCP stream that would go beyond proving that a web archive was created at a certain time, but also that it was served using that person's private key and thus from that person's web server. To claim otherwise, the subject would have to claim that a fraudulent certificate was generated for their domain or that their web server was broken into.

Unfortunately that's not possible.

Basically, the way the crypto math works in HTTPS is it's a symmetrical proof that only proves that either the sender or the receiver sent the TCP stream. Normally that's OK, because you trust yourself. But in this case the problem you're trying to solve is to prove what happened to a third party who doesn't trust the receiver, so your idea doesn't work.

Damn you diffie hellman!

It's the same with the RSA key exchange. It's inherent in the fact that the TLS negotiation exists to make both sides agree on a common master secret (and some public cryptographic parameters like which cipher to use), from which all the keys used to encrypt and authenticate either direction of the stream are derived. Once the master secret is known, all keys are known and the rest of the connection can be decrypted and/or forged at will. (The "triple handshake" attack exploits this, by making two connections share the same master secret.)

The certificate is used to sign (parts of) the values used to create the master secret. It doesn't sign anything after that.

To determine that a given file came from a particular person, it would have to have a signature from that person's private key.

1. The web server doesn't sign content; HTTPS keys just provide encryption.

2. The HTTPS connection operates under the web server's certificate; it'll serve any file without regard to who created it.

Blockchain not required. People have been hashing things and publishing them in the NY Times for a long, long time.

This depends on how you define “blockchain”. If your model is bitcoin-style with attempts at anonymous consensus it's definitely a negative contribution.

If you're not trying to get rich quick, however, something a Merkle tree is a great fit and it seems like there'd be value in a distributed system where trusted peers can vouch for either having seen the same content (even if they cannot distribute it due to copyright) or confirm that they saw you present a given object as having a certain hash at a specific time. Whether that's called a blockchain is a philosophical question but I think it'd be a good step up over self-publishing hashes since it'd avoid the need for people to know in advance what they'd like to archive.

To make that concrete, imagine if the web archiving space had some sort of distributed signature system like that. The first time the integrity of the Internet Archive is called into question, anyone on the internet who cared could check and see a provenance record something like this:

IA: URL x had SHA-512 y at time z

Library of Congress: URL x also had SHA-512 y at time [roughly z]

British Library: We didn't capture URL x around time z but we cross-signed the IA and LC manifests shortly after they crawled them and saw SHA-512 y

J. Random Volunteer Archivist: I also saw IA present that hash at that time

That'd give a high degree of confidence in many cases since these are automated systems and it'd lower the window where someone could modify data without getting caught, similar to how someone might be able to rewrite Git history without being noticed but only until someone else fetches the same repo.

(Disclaimer: I work at LC but not on web archiving and this comment is just my personal opinion on my own time)

That won't work because any hash on it's own would be trivial to regenerate after modifying the data. You need something that can't be changed retrospectively in order to trust it.

That's what makes the blockchain useful - to change anything you'd need to regenerate all the hashes after the point you want to modify. That's a lot more difficult. Having a proof that's generated by network of parties (like a cryptocurrency) would add to the trust level, but it's not essential.

EDIT: If the archive published hashes of everything they added daily in the NYT (or any publication) it would become unprintably large. It would only work digitally, at which point we're back to something that's trivial to modify...

What do you mean by "regenerate"? Making a new, unrelated hash? That doesn't do anything to a printed newspaper.

(If you could get the same hash then even a block chain won't give you integrity.)

If your list of hashes is huge, you can just print the hash of the entire list. There's no such thing as unprintably large for mere attestation.

Is it still possible to place classifieds in the NY Times? I don't think there's still anyway for someone to call up and have some random hash published, right?

According to their website they still have classifieds: https://advertising.nytimes.com/

I suspect the hardest part of doing that would be simply that you don't fit into their pre-existing categories.

FWIW, if you plan to do that, I'd suggest you put a Bitcoin block hash in the NY Times instead, which would prove the timestamps of everything that's been timestamped via Bitcoin. You can then timestamp your own stuff for free via OpenTimestamps, at which point your proof goes <your data> -> OpenTimestamps -> Bitcoin -> NY Times.

Timestamps are additive security, so it makes sense to publish them wisely. But if you're going to do that, might as well strengthen the security of as much stuff as possible in one go.

>I suspect the hardest part of doing that would be simply that you don't fit into their pre-existing categories.

You can put any classified ad in any category you want; the newspapers don't care.

I proposed to my wife by placing an ad in the real estate ads of the Vancouver Sun because I knew she'd see it there.

Ha, that's a lovely story, and good to know. :)

Forgive my shallow understanding of block chain, but wouldn't that make the archive immutable? Surely there are times where the Wayback Machine needs to delete snapshots, in cases where there's copyright infringement or other illegal activity.

Yes, it would make the archive immutable, but that doesn't prevent the data from being deleted.

A very similar example is found in git repos: while normally you'd have every single bit of data that lead up to git HEAD, you can use git in "shallow" mode, which only has a subset of that data. If you delete all but the shallow checkouts, the missing data will be gone forever. The missing data is still protected from being modified by the hashing that Git does - and you're guaranteed to know that data is in fact missing - but that cryptography doesn't magically make the data actually accessible.

> Forgive my shallow understanding of block chain, but wouldn't that make the archive immutable?

Kind of. The current state of the archive is mutable, but that changes to that state are logged to an append-only edit history — it's that edit history that is the "blockchain", and starting from a known good state and replaying all those edits must produce the current state. In fact, this is how cryptocurrencies work too — the state is the balances/utxo set, and the blockchain records transactions, which are effectively just mutations on that state.

In this situation, you'd look at the current state and find the deleted snapshot missing, but the edit log would have an entry saying the snapshot was added (and what its hash was at the time), then another entry saying it was deleted.

This is also an issue for major blockchains in deployment now, specifically Bitcoin. There is the potential for illegal content, or links to it, to be stacked on BTC’s blockchain [0], and so anyone who holds that blockchain would also possess it.

I believe this would also be an issue for things like Filecoin/IPFS but I’m not sure if the liability issues are different or nuanced.

[0] https://www.theregister.co.uk/2018/03/19/ability_to_dump_ill...

IPFS works like torrents: users only host things that they choose to, so there's no issue of some people being stuck hosting content they don't want to.

If you put the data itself in the blockchain then that would be true. I'm suggesting putting a hash of the data in a blockchain; you could delete the data and keep the hash in the chain. You couldn't regenerate the hash to check it which might be a problem but if the data has been deleted you'd have to accept the hash regardless. It'd only affect that link in the chain. (This is from my limited understanding of blockchain math. I definitely could be wrong.)

Soo... a Merkle tree?

Paper archives usually contain a ton of copyrighted material, e.g. "John Doe's papers" includes magazines, newspapers, letters written by other people, etc that are not copyright by John Doe.

Wouldn't an immutable archive where access was limited in such cases be a better alternative (even if it would require a law change)?

So I have an anonymous twitter account that tweets out various randomly located headlines, a couple times per day. Simply embedded in one of those tweets, each day, is a hash of the previous hash plus the current contents of some long-running data that I've been keeping and updating.

It's not as robust as a blockchain (maybe!) but it's easy and I've been doing it a good bit longer than 'blockchain' has been talked about. More importantly, I can use it to prove that I possessed certain files at certain times, historically.

> I have an anonymous twitter account

> I can use it to prove that I possessed certain files at certain times, historically.

Your twitter account would no longer be anonymous at that point. What's the utility of it being anonymous now?

I consider the value of it being anonymous right now to be unknown or undefined. In the same way, I consider the value of it being non-anonymous right now to also be unknown or undefined. Since disclosure can only flow in one direction, I'm not aware of any reason to irrevocably transfer from one state to the next.

So, a Merkle tree?

Roughly that, yup.

A blockchain solution is unnecessary for this kind of issue. The question is did Reid author the posts or a hacker? You just need signing to prove that. If all of Reid's posts were cryptographically signed, then a post by a hacker would be mysteriously missing a signature and the debate would be trivially resolved.

"Those dastardly hackers must have accessed my private key."

Or IPFS, or just signing content online and publishing a way to verify it, which can exist without a decentralized system.

But you can't force a publisher to use something like that, especially if it's the publisher that wants to deny its authenticity.

Apparently there are also archived copies of some of the posts on the Library of Congress site that were saved back in 2006 when they were originally posted: http://ws-dl.blogspot.co.uk/2018/04/2018-04-24-why-we-need-m...

I'm not sure how old the Internet Archive copies were since they're no longer available, but at least one of the ones saved elsewhere was originally archived in 2012.

Thanks for the links. Seems like the Streisand Effect is going to bring this to a wider audience than would have originally seen it. And all for the want of an apology.

slightly off topic, but can right to forget laws force sites like this to be remove entries you designate?

For what it's worth, I have habitually been saving things with https://addons.mozilla.org/en-us/firefox/addon/save-page-we/... since last year. Though I have some other motivations, having a git repo with auto snapshots every minute and continously snapshotting HN discussions and changing newssites has been giving me a "I'm gonna analyze the shit out of this" vorfreude since 2 months (don't know the english word, german for literally "pre-happiness")

I have been practicing the same habit. I save local copies of anything and everything I find interesting. I can't tell you how many obscure youtube videos seemingly disappear forever less than a month after I bookmark them.

Many tech savvy people I know have been similarly burned by link rot and now curate their own archives.

I have a bookmark file with thousands of links that has it's origins in the late 90s. It's sad just how much of it is dead links.

It's sad enough when domains lapse and die, but what is really annoying is every site seems to change structures - bookmarking is becoming so unreliable these days.

A site gets a new CMS--whether it's a news site or some other organizations--and there's a good chance that they're going to break a bunch of links.

I actually used archive.org recently to get money back from a company. I had purchased my spouse a heart-monitoring watch band for her Apple Watch. (The Kardia one that can create an EKG.) At the time I purchased it, there was no wording on their website claiming that you needed a paid account to use the functionality. It was all written as being optional. So after 30 days went by and her device would no longer work for its main purpose, I wrote to the company. Turns out they've updated their web site, so I went to archive.org, made screenshots of what it looked like on the day of my purchase and told them, "you can either give me a full refund or I can do a chargeback." We got an immediate refund and sent the watchband back.

as a fellow german:

giddiness maps wondefully in that sentence.

but to stay rational: 'looking forward to', 'excited for' are the more objective terms.

And "anticipating" could be a more general term that doesn't necessarily have the happiness connotation. "Eagerly anticipating" would give a hit to a certain desire for it to happen, but not necessarily happiness. Of course, anticipatory happiness can be read in via context if it's an obviously positive event.



vorfreude == “joyful anticipation“ says the wiktionary.

I think it would just be anticipation.

Just that word wouldn't be unambiguous, e.g. (again via Wiktionary):

"Often the anticipation of a shot is worse than the pain of the stick."




For me anticipation would be used when it was something positive, for a negative emotions I would use dread.

"Often the dread of a shot is worse than the pain of the stick."

> for me

I think it's just you, see Merriam-Webster, there are 1a, 1b, 2, 3a, 3b and 4 meanings, and yours is just "1b" as far as I see.


Also see


Joyful anticipation is not a phrase I have ever heard or expect to hear. It is too forced, and whilst it may be an exact translation from the German word in question it doesn't really answer the question. In the sentence used by the GP anticipation would be the most appropriate to use.

I wrote a cli application for this purpose. https://github.com/marvelm/erised

How does it handle assets like CSS, JS and Images?

You may want to put some of that stuff on IPFS, it's a good medium for snapshots. Just a thought.

The way I read it... the Wayback Machine did not allow archives to be taken down arbitrarily -- but a subsequent targeted robots.txt exclusion of the Wayback Machine could render prior archives of that website moot? (Because the Wayback Machine starts from scratch each time?)

Yeah, it doesn't make a lot of sense to me.

They say they "declined to take down the archives"-- but they didn't in fact do this at all, they just insisted a request to take down the archives come in the form of a robots.txt, and they automatically and without review comply with all such requests in the form of a robots.txt. They don't in fact ever decline to take down any archives, if the request is properly given as a robots.txt.

I don't know why they bothered making statements about "declining to take down the archives" in the first place (to the journalist or to us), making comments about "Reid’s being a journalist (a very high-profile one, at that) and the journalistic nature of the blog archive" -- they did not in fact "decline to take down the archives" at all. The "journalistic nature of the archive" was in fact irrelevant. They took em down. They are down.

There may be one slight (and practically irrelevant) difference: the blog posts are still archived by the wayback machine, but it refused to serve them due to the robots.txt file.

If that file were to be removed, presumably the archive would again be served up upon request.

The lawyers were asking for the archive store itself to be wiped.

Yeah, I've come to appreciate that their frustrating robot.txt policy is actually a wise way for them to avoid chipping away at the complete archive. Obviously they have a mission that stretches far beyond the lifespan of todays internet users and the relevance of robot.txt files.

My interpretation of what happened is that Ms. Reid's lawyers requested that specific posts within a full archive be removed from the archive. In other words, they weren't asking for removal of the entire archive. They just wanted a "sanitized" version to be accessible to the public.

The Internet Archive has a mechanism for doing this, as I understand it. It involves asserting copyright over the material in question and essentially "making a case" for removal. IA decided the case they made didn't pass muster, and denied specific removal on those grounds, which is why they mention "journalistic nature of the archive" and so forth.

But that's entirely orthogonal to their policy of treating active maintenance of robots.txt as indicative of positive copyright assertion over the contents of an entire domain -- which Ms. Reid's team appears to have taken as a fallback position. They couldn't get the sanitized archive they wanted, so they just made the whole thing invisible.

Why couldn't they get the sanitized archive they wanted with robots.txt, but just excluding the specific URLs they wanted excluded in robots.txt?

If that's the case you could buy up defunct domains, exclude everything via robots.txt and selectively purge sites from archive.org

That seems like a pretty glaring flaw in something designed to create an enduring record.

For reference I've submitted a post on the Danish net archives https://news.ycombinator.com/item?id=16919264 which by law will archive all of any 'Danish' site and ignore robots.txt exclusions.

from the faq: http://netarkivet.dk/in-english/faq/#anchor8

8. Do you respect robots.txt? No, we do not. When we collect the Danish part of the internet we ignore the so-called robots.txt directives. Studies from 2003-2004 showed that many of the truly important web sites (e.g. news media, political parties) had very stringent robots.txt directives. If we follow these directives, very little or nothing at all will be archived from those websites. Therefore, ignoring robots.txt is explicitly mentioned in the commentary to the law as being necessary in order to collect all relevant material

I wonder if there are any other national archives of the internet that do the same.

The relatively clear mandate for collecting all published materials given to the danish royal libary by the danish constitution is kind of rare in it's scope and status.

https://www.bl.uk/collection-guides/uk-web-archive describe the much more limited aproach taken by the british library much later in time, but might extend to a similar scope.

That web domain dataset they get from the internet archive is interesting in light of the current discussion, in that I am supposing it probably has .uk content that has been removed from the actual internet archive by robots.txt changes.

I think if I were running a national or internationally mandated archiving initiative I would basically want to take in content from Internet Archive, and not remove things, and probably it would be less expensive that way than having my own crawler.

It's actually very clever.

The key is it works both ways. By respecting the live robots.txt, and only the live one, data hiding must be an active process requested on an ongoing basis by a live entity. As soon as the entity goes defunct, any previously scraped data is automatically republished. Thus archive.org is protected from lawsuit by any extant organisaton, yet in the long run still archives everything it reasonably can.

They seem to have had this problem in the past and decided to skirt around it by ignoring robots.txt a year ago[0]. Does anybody know what happened to revert this decision?


As per that post, they only ignored robots.txt for .gov and .mil sites.

IA disallow in robots.txt will still block archive.org, the blog post was about ignoring parts that were meant for search engines.

Yes, but it also says

> We are now looking to do this more broadly.

That's the part I'm asking about.

Right, but it doesn't mean they reverted it, they are probably still looking into it.

I don't think they purge the archives, I think they just don't serve them on the wayback machine.

Yes. Instead of deleting anything, I think the Archive tends to mark stuff as "do not show this for a few decades."

Do you have a source for this? I didn’t know that but it’s very interesting – a good compromise between the interests of current website owners[1] and future historians.

[1]: Sure, some people just want to hide embarrassing or incriminating content, but there’s also cases where someone is being stalked or harassed based on things they shared online, and hiding those things from Archive users may mitigate that.

Generally when items are "taken down" from the Internet Archive, they just stop being published, and are not deleted.

I don't think it's mentioned in an official document, but it's usually referred to as "darking".

It probably safe to assume that the same concept applies to the Wayback Machine as to the rest of IA.

Edit: Here's a page that indirectly conveys some information about it: https://archive.org/details/IA_books_QA_codes

I thought I'd read it in a blog post by Jason Scott at textfiles.com, but I couldn't find a reference quickly. It could have come from conversation, as I've visited the Internet Archive a few times.

Which won't be legal, it seems if serving EU based users, and keep PII on any other EU users.

Yup, that already happened in the past. For some reason, this is apparently a feature, not a bug.

From what I understand, this is the deal they make to avoid getting sued by everyone for copyright violation.

It is a glaring flaw. It's meant a lot of sites ended up wiped out of the archive (or at least made inaccessible) simply because their domain has expired and the domain squatters blocked the empty domain from being indexed.

The solution (which the Internet Archive really needs to implement) is to look at the domain registration data or something, and then only remove content if the same owner updated the robots.txt file. If not, then just disallow archiving any new content, since the new domain owner usually has no right to decide what happens to the old site content.

I don’t read it like that. I think it’s ‘We can no longer crawl and archive the site since robots.txt changed to exclude us. Archives from before that exclusion are still available (though not publicly.)’

Anything otherwise obviates the entire mission of the Wayback Machine.

Fighting a bunch of court cases to keep perfect purity in their principles would cost too much, when the number of these exclusions is quite small. Better to provide an easy out to a few whiny people like this author, than to spend all their money fighting every edge case on principle and in so doing bail on the entire mission of the organization.

>Fighting a bunch of court cases to keep perfect purity in their principles would cost too much, when the number of these exclusions is quite small. Better to provide an easy out to a few whiny people like this author, than to spend all their money fighting every edge case on principle and in so doing bail on the entire mission of the organization.

She's a prominent left-wing television personality. Would you be so accommodating if it were, say, Tucker Carlson trying to scrub embarrassing information about himself from the wayback machine?


> (Because the Wayback Machine starts from scratch each time?)

I don't know what "start from scratch" would mean – the point is that each site is sampled many times throughout history. That said, it is very odd that a current change in robots.txt would prevent looking at old samples. And that's indeed what it looks like [1]:

> Page cannot be displayed due to robots.txt.

[1] https://web.archive.org/web/*/blog.reidreport.com

I'd imagine they're in a dodgy copyright situation and so guard against it by being conservative wrt robots.txt.

The robots.txt shows a positive assertion that parts of a site should be excluded from being used by automated systems.

In most cases I imagine WBM does not have permission of the owner to keep a duplicate of the site, it's certainly tortuous in UK law.

Sites that don't change their robots.txt are probably highly correlated with sites that don't sue for the infringement.

In fact they used to document this behavior at https://web.archive.org/web/20150411133228/https://archive.o... but they removed the documentation in mid-2015:

""To remove your site from the Wayback Machine, place a robots.txt file at the top level of your site (e.g. www.yourdomain.com/robots.txt).

The robots.txt file will do two things:

1. It will remove documents from your domain from the Wayback Machine. 2. It will tell us not to crawl your site in the future.

To exclude the Internet Archive’s crawler (and remove documents from the Wayback Machine) while allowing all other robots to crawl your site, your robots.txt file should say:

User-agent: ia_archiver Disallow: /""

Their current documentation no longer says that they stop displaying old archives automatically in the presence of an ia_archiver Disallow directive, but I have not experimented about whether they still actually do this anyway.

Just guessing wildly about similar issues, I know that news organizations which have a publicly documented "unpublish" policy tend to get that policy used aggressively by reputation management firms and the like.

My understanding of their robots.txt policy is that it only removes the public archive of the site and they still have a private archive. That said, I'm working from memory here and I wasn't able to find anything in their FAQ to confirm this, so it's possible I've misremembered something, though I did find some forum posts saying that formerly-available archives returns a 403 instead of a 404, which might provide some evidence for this.

Because the Wayback Machine starts from scratch each time?

I don’t think that’s it...it’s not a technical thing. Deleting all archives must be a courtesy they extend to anyone that specifically denies access to Wayback Machine in their robots.txt. Does anyone know if this is documented? If so, why didn’t her lawyers just carry out the robots.txt technique and not even bother contacting them? Most importantly, why would they have such a policy? This is all very odd.

Do we know that the archive was deleted or merely taken offline? The blog post says "excluded" which to me imply that they still have the data, it's just not publicly available.

I hang around in emulation circles and there's been some talk in the past few weeks because some Nintendo ROM archives had been taken offline from archive.org but people soon figured out that they could still access them by tinkering with the URL. The situation is a bit different here though.

I assume it isn't deleted...I misspoke. But the net result for the public is the same.

Come on, Joy. You blog posts weren't even that bad, they were just in poor taste. You didn't say anything particularly vitriolic or hateful. This is your opportunity to admit that these were once your views, and emphasize your personal growth since then.

Instead, you are just going to pretend that your past self never existed...

"I find gay sex to be gross" isn't that controversial of an opinion. Plenty of open-minded, accepting people agree with you. It just wasn't a worthwhile opinion to espouse...

Own it, Joy. Don't just play dumb, because now you just look dumb.

>You blog posts weren't even that bad

Some of them crossed the line and were no longer merely disappointing.

>claims that gay men prey on “impressionable teens”


You're not wrong. Her blog is tame compared to some of the comments I see DAILY in ANY subreddit on reddit.

What Joy SHOULD have done was admitted that she "used" to be a homophone, and apologized.

She's still a homophone; Reid sounds exactly like "read".

Isn't that the point though? She wasn't a homophone in the past.

"homophone" - each of two or more words having the same pronunciation but different meanings, origins, or spelling, e.g., new and knew.

"homophobe" - a person with an extreme and irrational aversion to homosexuality and homosexual people.

Just a little joke about a linguistic mixup.

:) Might just be my pronunciation, but Reid only sounds like the present form.

Is the use of the word "homophone" some in-joke that I'm not aware of?

I agree.

The better narrative would be to say:

"This is who I used to be, there were many people like me at the time, but my views have evolved and I've become close friends with many of the people that my words hurt. I'm sorry and I work every day to make up for these mistakes."

Actions speak louder than words.

Wait so adding a robots.txt exclusion for the Wayback Machine makes all previous archives of the site inaccessible? That's very odd behaviour, and really not the point of a robots.txt file... I would expect a robots.txt to control a bot's visits / scraping behaviour, not a site's history.

Yes. They use the robots.txt file to essentially ascertain ownership or control of the domain. Archive.org doesn't want to delete the content they have (for whatever reason), so the compromise they came up with is to read the robots.txt and then hide the content they have archived if the present domain owner/controller wants it to be that way.

If you remove the robots.txt setting, the archives become available again.

So if you find something you'd better make a copy of it yourself because it might be going dark. Doesn't that kind of defeat the whole purpose of the Wayback Machine?

So I try to make a copy of any interesting web pages on archive.is these days.


This one is funny, because conservatives have used archive.is for some time to archive and mock left-leaning websites and some of them blocked archive.is in the past and still block archive.is today.

VOX for example returns a 0-sized page for archive.is. In the past VICE returned 404s to archive.is https://i.imgur.com/OnFdVpS.jpg

What I mean to say is that these services are useful but they are not faultless.

VICE didn't just block archive.is, they blocked the Internet Archive too by returning the exact same 404 page. They really didn't want any archived copies of their posts hanging around anywhere outside their control.

Why are so many irrelevant political left-vs-right "he said, she said" type comments popping up on HN just lately?

I personally think this is simply a result of how much harder the media is pushing that divide (for all of their various purposes). I actually spent some time last year researching this, because I thought I might have just become an old man thinking how great things used to be. I started reading old news stories fairly randomly, from the present time all the way back to the Vietnam era (and a few rabbit holes to earlier times). The first thing that surprised me was the amount of link rot that exists. I always knew intellectually that it was a problem, but wow. It's bad. The second thing that I found was that indeed, the media hammers on the "us-versus-them" political divide of American politics much, much harder nowadays than even ten years ago. I think Fox News was really the turning point. It opened the flood gates. I always remember thinking how "extreme" Fox News was, but I challenge anyone to look up a few of their older stories from the middle of the last decade. It's child's play compared to what pretty much every media outlet is doing today. You can hardly read a recent news story from just about anywhere without being told how it's supposed to fit into our political worldview, and how we should feel about it, and why it's good/bad/stupid/amazing/"terrifying". And so of course, because of this, people are just responding to the programming. Creating the world they're led to believe we they live in. I think it really is that straightforward.

Did you just look at print? Talk radio has been hammering this since the late eighties. Hell you can probably draw a line straight from the "Moral Majority" shit in the seventies, to where we find ourselves now. I suspect this has always been a big part of American culture, but it's being magnified now either by new tech or malicious actors or both.

Oh, you know, that's interesting. I hadn't even thought about talk radio, but you're absolutely right.

Do you have a selection of those old stories - would be interesting .

Ah, I apologize, I didn't keep notes or save links or anything of the sort, and I keep kicking myself for it. I'm usually pretty good about taking notes just out of my regular habit of doing research, but it was so casual, and I didn't think it would end up taking as much of my time as it did. It's a pretty easy formula to replicate, though. I picked current events that I could remember -- intervention in Kosovo, Bill Clinton's sex scandal, Berlin Wall, first election of Putin, Enron scandal, those kinds of things -- and just started looking up stories, and asked my parents and older friends to help me with events I wasn't old enough to remember before the 80s. I made sure to hit a "good" cross-section of the media outlets of the day.

I'd love to see the Bush/Gore 2000 election play out on social media. I was only a kid but the news coverage seemed pretty mild compared to how I imagine it would be if that happened in 2016.

Thanks - it seems like a half day project to build a spider for this ... one for the list :-)

Everything is partisan political now. What books you read, what films and TV series you watch, where you live, the definition of "political" itself, and to some extent even what internet archiving service you use. (In reality, left-leaning folks have used archive.is to save and mock conservative sites for some time too, but even though this happens across the board it's still normal and expected to think of this as a partisan political activity because everything is now.)

The people that used to hang around /r/incel are now spending their time elsewhere on the net.


Mentioning that certain publications block archiving is not irrelevant.

I think having written short sighted things and then regretting them is a somewhat universal thing. I also don't have a problem with either side using previous writings, as long as they are reproduced accurately and faithfully.

In short, before you publish a blog post that is sexist/racist/homophobic/whatever, consider that even if you delete it, others may have a copy and will use it against you.

How you maintain cognitive dissonance in defending such personal blog sites masquerading as news outlets despite admitting that a word-for-word reproduction of their words constitutes mockery is beyond me.

well in this case it is a left leaning activist who taking this other left leaning activist down. it is not always the politics we expect but we can guarantee if its political it will be nasty.

the issue I have is, we should not be able to just block access to archived content because its embarrassing.

Yes, but it's also always been true that if you want to keep something for reference, you make your own copy.

The problem with that though is that people will think you've manipulated your copy. I've had people accuse me of this when I save pages with screenshots. You need to have a trusted third party make and store the copies.

Sometimes I use archive.is , they don't automatically delete because of robots.txt but it's not fully clear to me when they do delete things.

One method for making a copy or crawl very difficult to tamper with is to publish a hash somewhere difficult to forge (e.g: in a national newspaper or opentimestamps). That won't prove the copy wasn't manipulated before it was archived, though. For that, we would need multiple, independent archives.

This is effectively what libraries have been doing for many years with their archives of newspapers.

The hashing needs to be done by a trusted third party. It would be a cheaper to operate service than wayback, but would let you check individuals content against manipulation.

You have to incentivize people running and storing the hashes.

You could put it in the bitcoin blockchain. Or if you don't need that level of complexity and cost, you could put it on twitter, which doesn't allow editing tweets (but does allow deleting).

This would be entirely overkill, but in theory if you're accessing the site via https couldn't you record the conversation from your end and later prove it?

Edit: thought about this some more, I don't think this would work since in ssl iirc you agree on a symmetric encryption key which is then used to encrypt the rest of the request response cycle.

You'd need a proven time stamp on that conversation, or else the site could just switch certs and then leak their keys. Then, they can claim that you forged the traffic using the leaked keys.

I can't believe I'm the one to propose this but being able to unjustifiably timestamp some data is one of the only actual use cases of blockchains. Archive your data, compute a checksum and store that in a Bitcoin block and you can prove later on that you actually owned the data at this point.

Of course there are other ways to achieve that such as publishing your checksum to a vast number of neutral third parties such as in a mailing list, bittorent or even a newspaper. You could also rely on a trusted 3rd party who has low incentives to manipulate the data (or would risk a lot if they were caught cheating) such as a bank, insurance or notary for instance.

I think archive.org could potentially do something like that by using a merkle tree containing the checksums of the various pages they've archived during the day and publish the top hash every day for anybody to archive (or publish it on a blockchain or whatever, as said above). If later on somebody accuses the archive of manipulation they can publish the merkle tree of the day the site was archived which contains the checksum of the page, and anybody having the top hash of that day can vouch that it is indeed correct.

It doesn't stop the archive from storing bogus data but it makes it impossible to "change the past" post-facto, so in this particular situation the journalist could only complain that the archive stored bogus data back in the day and not that it was recently tampered with.

I immediately had the same idea to use a 3rd part to host checksums, surprised they haven't done this. Blockchain makes a lot of sense from the immutability standpoint, but how would you incentivize people to maintain it? Maybe you can get people to do that for the common good a la wikipedia? Not sure about that. Maybe you get Apache to bake it into their webserver to ask people to opt-in to dedicate 0.1% of resources to the cause?

I was thinking about using an existing blockchain such as Bitcoin. Of course then the inconvenient is that archive.org would have to pay a fee every time they submit a new hash. A comment above pointed out that the scheme I described (unsurprisingly) already exists at https://petertodd.org/2016/opentimestamps-announcement

Realistically it might be overkill though, simply setting up some mailing list where anybody can subscribe and be sent the checksum every day or even just publishing it at some URL and letting users scrap it if they want might be sufficient. If we're talking about one checksum every day it's only a few kilobytes every year, it shouldn't be too difficult to convince a few hundred people and organizations around the world to mirror it.

I think you need a checksum for every page, not every day. How would you independently verify the checksum for an entire day?

Merkle trees. Whomever wants to store a timestamp for some message stores the path from that message to the root of the Merkle tree. Only the root of the merkle tree of each day needs to be published.

No. HTTPS doesn't provide non-repudiation.

HTTPS does not sign the content. It MACs the content.

Yup! This is the problem. There’s a blog post out there by a security guy tittled “nothing is real” where he covers how nobody can truly trust the data they get from computers.

Do you think you can find it? Sounds interesting.

I found my tweet about it:


As an aside, I’ve been noticing Google has been getting worse and worse at finding something I’m sure is out there. I’m not saying that they’re necessarily getting worse, but maybe it’s getting harder to deal with the sheer scale of the web these days.

It's good enough to combat link rot if you need to or want to refer back to or read something again. Nothing on the internet is permanent and even if it is still out there, that's no guarantee that Google or any other search engine still has it in their index. At least that way the information isn't lost to you.

This highlights to me something about long-term management of my domains. "blog.reidreport.com" is now run by some domain squatter - and knowing nothing about Reid or reid report I took a vaguely generic website at face value - and clicked on the heavily disguised paid adverts.

Clearly her domain is defunct - but I got suckered and actually came here to say things like "what terrible journalistic standards" before double checking.

As my old domains fall into disrepair I guess I will need to archive them to S3 and keep up the payments just to stop this happening.

An interesting problem - and possibly a revenue source for archive.org?

EDIT: Hang on - the article on archive says (someone) added a robots.txt to block them. But the blog.reidreport.com is parked on some crappy redirect thing.

Whois says that joyannreird@gmail.com still owns the domain - so I think she has got some very very bad advice from her hosting company. And my point still stands - a domain name is a reputation, and it is for life, not just for christmas.

Are you sure you're spelling the URL correctly? `blog.reidreport.com` (as you spelled it in your post) redirects to a Blogger.com "Permission denied" page. Not a squatter.

I think I'd be concerned about your client redirecting you to a squatter page.

That was... vague

Why would the robots file on an active site be applied to the archived content?

Legal issues. Who owns the content? There is no real legal basis for a right to mirror, despite how it feels from a techy point of view.

Fair use. It’s for documentation, educational, and research purposes.

There are limits on the amount of content you can redistribute under "fair use" for a given purpose. I'm not sure if redistributing half of the internet would be legally justifiable.

It’s certainly a good Supreme Court case. EFF/ACLU should be able to cover this case.

Google’s distributing more content than archive.is.

That's technically an affirmative defense, rather than a right.

Lawsuits are spendy. TIA have Streisanded the issue.

Robots.txt now has legal status?

It undoubtedly communicates some intention by the site owner about how the work on the site can be used, so probably?

It only communicates the intention if the reader is actually obliged to read and interpret the file as a legal document.

If I put a file named "legal.txt" in an online folder, is anyone required to read it and act upon it? It might as well be a file intended for some completely unrelated purpose; e.g. a lawyer that put some drafts online, or for all the reader knows, it might even be part of a movie script.

robots.txt has been a de facto standard for over 20 years. Someone might be able to claim ignorance, but the Internet Archive has shown that they know about it. It has a specific format; if it can be parsed, it's safe to assume that it isn't part of a movie script.

In most cases, copyright law requires the reader of a document not to republish it, so the robots.txt standard is actually much more permissive.

It represents the expressed preferences of the website owner, so it is legally relevant.

I agree. That sort of defeats the purpose of something that is literally meant to archive

TeMPOraL had the most plausible explanation up thread a bit - apparently it's there as a copyright claim safety-net for Wayback Machine, but making a feature that runs counter to the whole point of the site publicly available makes my head hurt.

Perhaps having a bot generate a synopsis of removed content, and showing that in its place would solve any copyright issue fairly elegantly?

Generating a synopsis might have other implications (e.g. accusations of libel if the original author considers the synopsis attributes false claims to them).

A synopsis is a derivative work. As such, it falls within copyright laws.

Huh,I'd always thought they fell under fair use in the same way a review does.

You learn something new every day... Multiple times a day in my case!

Pragmatic decision to avoid big stinks that might risk the extremely high ratio of sites that _don't_ currently block IA.

Agreed, it doesn't make any sense to apply those rules retroactively.

Applying the various razors, I find that the hypothesis that would need to be refuted first is that Reid wrote the posts, they were archived as written, unaltered since then, and she simply does not wish to take responsibility for their contents now.

Who had motive to alter the posts in question? Who had the opportunity? When could it have happened? What method did they use to do so?

If Reid's team cannot plausibly answer those questions, we are still examining the simplest hypothesis, and have seen no plausible evidence that it should be refuted.

If we are to believe that those posts were written by someone else posing as Reid, would that suspicion not apply equally to everything appearing on her blog now? In which case, the solution has always been to sign the post using public-private asymmetric cryptography and to employ a public timestamp server to verify the time of publication.

The robots.txt exlusion loophole has been known for quite a long time.

>The robots.txt exlusion loophole has been known for quite a long time.

Yes, but it seemed like they had changed their mind, exactly because there is a huge issue with "expired" domains, see:



They experimentally ignored robots.txt on .mil and .gov domains, and I thought they were going to extend this new policy for all archived sites.

The situation/status is not clear, though the retroactive validity of robots.txt remains (at least to me) absurd.

It is IMHO only fair to respect a robots.txt since the date it has been put online, it is the retroactivity that is perplexing, as a matter of fact I see it as violating the decisions of the Author, that - at the time some contents was made available - by not posting a robots.txt expressed the intention to have the contents archived and accessible, while there is no guarantee whatsoever that the robots.txt posted years later is still an expression of the same Author.

Most probably a middle way would be - if possible technically - that the robots.txt is respected only for the period in which the site has the same owner/registrar, but for the large amount of sites with anonymous or "by proxy" ownership that could not possibly work.

.gov and .mil sites are presumably public domain anyway because they're US government. Therefore, it makes sense to ignore instructions not to archive them.

In pretty much all other cases--except where they were public domain or CC0--it's probably not strictly legal to archive them at all. Therefore, it makes sense to bend over backwards to remove any material if asked to programatically or otherwise.

>I see it as violating the decisions of the Author

Maybe in some cases. But, for better or worse, preventing crawling is opt-in rather than opt-out, and defaults are very powerful. You didn't explicitly tell me that you didn't want me to repurpose your copyrighted material isn't a very strong legal argument.

I'm guessing retroactively respecting robots.txt is a political decision to protect the integrity of their archives. "Here's an easy way to automatically remove your stuff from the publicly accessible archives" prevents a lot of lawsuits and potentially bad press. It's annoying if it blocks a few sites, but better to quickly and quietly block a few than to generate so much noise that a bunch more get blocked. As a pragmatic strategy it's probably the best one for preserving the widest array of publicly accessible archives.

Is it reasonably feasible to extend the syntax of robots.txt to include date ranges when the entries are specific to the IA bot? That way, specific content from a certain time span could be retroactively suppressed if desired.

This would also solve situations where a new owner blocks robot access for a domain where the former owner is OK with the existence of the archived site.

Why allow retroactive suppression at all?

It seems to make the most sense to only have a robots.txt affect pages archived when that specific version of robots.txt is in effect.

Perhaps this article[0] may provide you with insight into the motivations of those who may prefer to suppress historical data.

0: https://en.wikipedia.org/wiki/Right_to_be_forgotten

There is no need to extend the syntax of robots.txt.

At the time of crawling the robots.txt is parsed anyway.

If it excludes part or the whole site from crawling, it should IMHO be respected, and the crawl of that day should be stopped (and pages NOT even archived) if it doesn't then crawling and archiving them is "fair".

The point here is that by adding a "new" robots.txt the "previously archived" pages (that remain archived) are not anymore displayed by the Wayback Machine.

It is only a political/legal (and unilateral) decision by the good people at the archive.org, it could be changed any time, at their discretion, without the need of any "new" syntax for robots.txt.

I think that enabling the user to selectively suppress parts of the site for certain archived time spans is a better solution. Sometimes, a page might have been in temporary violation of a law or contract and that version needs to be suppressed. But that particular does not mean that any other version needs to be hidden as well.

>I think that enabling the user to selectively suppress parts of the site for certain archived time spans is a better solution.

But that is easily achieved by politely asking the good people at archive.org, they won't normally decline a "reasonable" request to suppress this or that page access.

As a side note, there is something that (when it comes to the internet) really escapes me, in the "real" world, before everything was digital, you had lawful means to get a retraction in case - say - of libel but you weren't allowed to retroactively change the history, destroying all written memories and attempts like burning books on public squares weren't much appreciated, I don't really see how going digital should be so much different.

I guess that the meaning of "publish" in the sense of "making public by printing it" has been altered by the common presence of the "undo" button.

Another sign I am getting old (and grumpy), I know.

Well, you are missing the part where producing and selling new copies of e.g. libelous works can be forbidden in the real world. So the old copies will still be around, but they have to be passed on privately. Effectively, this takes affected works out of circulation.

No, actually it was exactly the example I made, in the case of libel someone with a recognized authority (a Court) can seize/impound the libelous material and prohibit further publication (and of course not destroy each and every copy in the wild), but the procedure is very different from someone (remember not necessarily the actual Author, actually only the owner of the domain/site at a given moment) being able to prevent access to archived material published in the past (material that does not represent a libel and is not violating any Law) only because he/she can.

A middle way could be to only observe robots.txt for crawling, and not for displaying pages. So once a page is grabbed, it's available forever. But if a page is covered by a robots.txt exclusion, it won't be crawled.

So who has the old version of the blog posts in question, so we can see what this journo has to hide?

This is a good blog post to link to those that actually believe Joy Reid in this case


This, right here, is why the GDPR's 'right to be forgotten' is so pernicious. Were Mrs. Reid an EU person, the Internet Archive could be forced to disappear her previous posts. Or given that she's also a public figure, would it be permitted to retain them? Only a court could decide.

I don't see this ending well for her. She should have just come clean and apologized ... again.

IPFS wayback machine when

When it scales multiple orders of magnitude better? Web archives are massive and have non-trivial storage costs.

What would be interesting would just be storing a Merkle tree for archive hashes so many parties could verify that a much smaller number of copies haven’t been modified.

A bit surprised at the comments here. As far as I can remember, robots.txt has always been used by the Wayback Machine this way. Often a bummer when the original domain expires and a domain squatter takes over - many have robots.txt

That's changed.

A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.

We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user’s perspective.



Bottom line: You can remove the content you own using robots.txt

Signed hashes with blockchain time stamping could be a solution to that. A standard format for content and publishing content (whether a blogpost or a tweet), then based on that block of text you provide a digital signature linked to your identity. The signature is published on the blockchain, though, this is optional.

The author will then be unable to refute (or able to) the authorship of the relevant text.

What's the GDPR impact on this?

If I have my own blog on my own domain, and Google and Wayback Machine archives it, can I request them to delete it one year later under GDPR?

In general you don’t need GDPR for that. It suffices to be a copyright holder to the content made accessible on the Wayback Machine.

You can already do this, by including an entry in your `robots.txt`.

Yes, unless it is in the interest of the public to keep it up [0].

[0] https://www.theguardian.com/technology/2018/apr/13/google-lo...

>can I request them to delete it one year later under GDPR

You can request they delete any personal/identifiable data. So any references to names, email, pictures of you etc but probably not content.

Wayback Machine does respect robots.txt, I think. I'm curious what happens if you lose control of your domain, though.

If you lose control the whole history of the site gets nuked by a new robots.txt. This has happened in a few notable cases fairly recently.

I guess so if that blog identifies you in some way, which you can probably argue it does.

PII is identifiable with you and personal. Though practically I can't see Google checking if particular pages have PII rather than just regular information.

Almost all my blogging is about tech, not PII at all.

I believe it.

>>...we declined to take down the archives.

OK, the way I read, author--one way or another--asked for her blog not being hosted by Wayback machine and they declined. It's my work, as long as I can verify that I wrote, they should take it down or be sued for copyright infringement.

I get the "we're archiving the internet," but if I want that post where I said Google is evil taken down because I have a G job interview a week from now, they should take it down. Another thing, just because I have a page online, doesn't mean that I gave them consent to archive it for eternity.

I get the robots.txt, but if you're archiving you should ask for permission, they are a gazillion robots out there.

Section 108 exception to copyright protection for public ally-accessible archives: https://www.law.cornell.edu/uscode/text/17/108

That exception is for physical works within a library. It's basically the only thing that makes libraries/archives special relative to you or I with respect to copyright.

The Internet Archive would be completely unworkable if they had to ask for permission first. So would the Internet for that matter. Anyway, the whole point is to preserve things that wouldn't otherwise be preserved for the future. The Internet Archive folks are thinking in timespans of centuries, not months or years. The service they are providing to current and future researchers and historians is invaluable.

The author released content for public consumption. While author has copyright, it’s not legal to revoke usage rights years after the fact. Not a lawyer, but think it’s related to usage rights under a copyright.

Imagine an author giving you a copy of the book and then 15 years later coming to your home library and asking for it back.

It’s cool to not post publicly or to restrict access. But releasing and then yanking doesn’t make sense.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact