The summary is that last year Reid (a journalist for a major US television news network) publicly apologized for a series of blog posts that were characterized as 'homophobic'. More posts of a similar nature were recently discovered on archive.org, and instead of apologizing for these as well, Reid has disavowed them. She and her lawyers claim that unlike the previous occasion, these newly discovered posts were altered by 'hackers' either before or after being archived. The linked blog post is making the limited claim the posts on archive.org accurately represent the posts present on Reid's site at the time they were archived, and do not appear to have been altered post-archiving.
This might actually be a good use case for a blockchain. Hashing the data that's added to the archive and then putting the hash in the blockchain would reasonably prove the data in the archive hasn't been modified at a later date.
It’s too bad all you can build with that is a meager, profitable SaaS business, not a wild speculative crypto-billionaire rocket ride.
Using a company called Catena:
Chunk-based hashing could work. Corresponding those to document structure (paragraphs, sentences, chapters, ...) might make more sense.
I'm familiar with RDA, FRBR, and WEMI, somewhat.
Here have yourself a timestamp https://petertodd.org/2016/opentimestamps-announcement
pip3 install opentimestamps-client
ots stamp myfile.txt
Unfortunately, the files themselves aren't public, and each file contains dumps from hundreds of websites, so even if they were public they're not the easiest thing to verify.
Still, being the guy behind OpenTimestamps I should point out that in this case I don't think timestamp proofs really add that much: Reid's claims seem dubious even without cryptographic proof.
Or they could just implement trusted timestamping (RFC 3161). Using a blockchain is a heavy-weight solution and is rarely the right one.
You really need better auditing than that, which is why the certificate authority infrastructure now relies on a blockchain - Certificate Transparency - for auditing. Similarly, for timestamping specifically, Guardtime has used a blockchain for auditing their timestamps since well before blockchains got called blockchains.
Surely if content is served over HTTPS with a valid certificate, it should be possible to save (possibly as part of a WARC) a "signature" of the TCP stream that would go beyond proving that a web archive was created at a certain time, but also that it was served using that person's private key and thus from that person's web server. To claim otherwise, the subject would have to claim that a fraudulent certificate was generated for their domain or that their web server was broken into.
Basically, the way the crypto math works in HTTPS is it's a symmetrical proof that only proves that either the sender or the receiver sent the TCP stream. Normally that's OK, because you trust yourself. But in this case the problem you're trying to solve is to prove what happened to a third party who doesn't trust the receiver, so your idea doesn't work.
The certificate is used to sign (parts of) the values used to create the master secret. It doesn't sign anything after that.
1. The web server doesn't sign content; HTTPS keys just provide encryption.
2. The HTTPS connection operates under the web server's certificate; it'll serve any file without regard to who created it.
If you're not trying to get rich quick, however, something a Merkle tree is a great fit and it seems like there'd be value in a distributed system where trusted peers can vouch for either having seen the same content (even if they cannot distribute it due to copyright) or confirm that they saw you present a given object as having a certain hash at a specific time. Whether that's called a blockchain is a philosophical question but I think it'd be a good step up over self-publishing hashes since it'd avoid the need for people to know in advance what they'd like to archive.
To make that concrete, imagine if the web archiving space had some sort of distributed signature system like that. The first time the integrity of the Internet Archive is called into question, anyone on the internet who cared could check and see a provenance record something like this:
IA: URL x had SHA-512 y at time z
Library of Congress: URL x also had SHA-512 y at time [roughly z]
British Library: We didn't capture URL x around time z but we cross-signed the IA and LC manifests shortly after they crawled them and saw SHA-512 y
J. Random Volunteer Archivist: I also saw IA present that hash at that time
That'd give a high degree of confidence in many cases since these are automated systems and it'd lower the window where someone could modify data without getting caught, similar to how someone might be able to rewrite Git history without being noticed but only until someone else fetches the same repo.
(Disclaimer: I work at LC but not on web archiving and this comment is just my personal opinion on my own time)
That's what makes the blockchain useful - to change anything you'd need to regenerate all the hashes after the point you want to modify. That's a lot more difficult. Having a proof that's generated by network of parties (like a cryptocurrency) would add to the trust level, but it's not essential.
EDIT: If the archive published hashes of everything they added daily in the NYT (or any publication) it would become unprintably large. It would only work digitally, at which point we're back to something that's trivial to modify...
(If you could get the same hash then even a block chain won't give you integrity.)
If your list of hashes is huge, you can just print the hash of the entire list. There's no such thing as unprintably large for mere attestation.
I suspect the hardest part of doing that would be simply that you don't fit into their pre-existing categories.
FWIW, if you plan to do that, I'd suggest you put a Bitcoin block hash in the NY Times instead, which would prove the timestamps of everything that's been timestamped via Bitcoin. You can then timestamp your own stuff for free via OpenTimestamps, at which point your proof goes <your data> -> OpenTimestamps -> Bitcoin -> NY Times.
Timestamps are additive security, so it makes sense to publish them wisely. But if you're going to do that, might as well strengthen the security of as much stuff as possible in one go.
You can put any classified ad in any category you want; the newspapers don't care.
I proposed to my wife by placing an ad in the real estate ads of the Vancouver Sun because I knew she'd see it there.
A very similar example is found in git repos: while normally you'd have every single bit of data that lead up to git HEAD, you can use git in "shallow" mode, which only has a subset of that data. If you delete all but the shallow checkouts, the missing data will be gone forever. The missing data is still protected from being modified by the hashing that Git does - and you're guaranteed to know that data is in fact missing - but that cryptography doesn't magically make the data actually accessible.
Kind of. The current state of the archive is mutable, but that changes to that state are logged to an append-only edit history — it's that edit history that is the "blockchain", and starting from a known good state and replaying all those edits must produce the current state. In fact, this is how cryptocurrencies work too — the state is the balances/utxo set, and the blockchain records transactions, which are effectively just mutations on that state.
In this situation, you'd look at the current state and find the deleted snapshot missing, but the edit log would have an entry saying the snapshot was added (and what its hash was at the time), then another entry saying it was deleted.
I believe this would also be an issue for things like Filecoin/IPFS but I’m not sure if the liability issues are different or nuanced.
It's not as robust as a blockchain (maybe!) but it's easy and I've been doing it a good bit longer than 'blockchain' has been talked about. More importantly, I can use it to prove that I possessed certain files at certain times, historically.
> I can use it to prove that I possessed certain files at certain times, historically.
Your twitter account would no longer be anonymous at that point. What's the utility of it being anonymous now?
But you can't force a publisher to use something like that, especially if it's the publisher that wants to deny its authenticity.
I'm not sure how old the Internet Archive copies were since they're no longer available, but at least one of the ones saved elsewhere was originally archived in 2012.
Many tech savvy people I know have been similarly burned by link rot and now curate their own archives.
giddiness maps wondefully in that sentence.
but to stay rational:
'looking forward to', 'excited for' are the more objective terms.
"Often the anticipation of a shot is worse than the pain of the stick."
"Often the dread of a shot is worse than the pain of the stick."
I think it's just you, see Merriam-Webster, there are 1a, 1b, 2, 3a, 3b and 4 meanings, and yours is just "1b" as far as I see.
They say they "declined to take down the archives"-- but they didn't in fact do this at all, they just insisted a request to take down the archives come in the form of a robots.txt, and they automatically and without review comply with all such requests in the form of a robots.txt. They don't in fact ever decline to take down any archives, if the request is properly given as a robots.txt.
I don't know why they bothered making statements about "declining to take down the archives" in the first place (to the journalist or to us), making comments about "Reid’s being a journalist (a very high-profile one, at that) and the journalistic nature of the blog archive" -- they did not in fact "decline to take down the archives" at all. The "journalistic nature of the archive" was in fact irrelevant. They took em down. They are down.
If that file were to be removed, presumably the archive would again be served up upon request.
The lawyers were asking for the archive store itself to be wiped.
The Internet Archive has a mechanism for doing this, as I understand it. It involves asserting copyright over the material in question and essentially "making a case" for removal. IA decided the case they made didn't pass muster, and denied specific removal on those grounds, which is why they mention "journalistic nature of the archive" and so forth.
But that's entirely orthogonal to their policy of treating active maintenance of robots.txt as indicative of positive copyright assertion over the contents of an entire domain -- which Ms. Reid's team appears to have taken as a fallback position. They couldn't get the sanitized archive they wanted, so they just made the whole thing invisible.
That seems like a pretty glaring flaw in something designed to create an enduring record.
from the faq: http://netarkivet.dk/in-english/faq/#anchor8
8. Do you respect robots.txt?
No, we do not. When we collect the Danish part of the internet we ignore the so-called robots.txt directives. Studies from 2003-2004 showed that many of the truly important web sites (e.g. news media, political parties) had very stringent robots.txt directives. If we follow these directives, very little or nothing at all will be archived from those websites. Therefore, ignoring robots.txt is explicitly mentioned in the commentary to the law as being necessary in order to collect all relevant material
I wonder if there are any other national archives of the internet that do the same.
https://www.bl.uk/collection-guides/uk-web-archive describe the much more limited aproach taken by the british library much later in time, but might extend to a similar scope.
I think if I were running a national or internationally mandated archiving initiative I would basically want to take in content from Internet Archive, and not remove things, and probably it would be less expensive that way than having my own crawler.
The key is it works both ways. By respecting the live robots.txt, and only the live one, data hiding must be an active process requested on an ongoing basis by a live entity. As soon as the entity goes defunct, any previously scraped data is automatically republished. Thus archive.org is protected from lawsuit by any extant organisaton, yet in the long run still archives everything it reasonably can.
> We are now looking to do this more broadly.
That's the part I'm asking about.
: Sure, some people just want to hide embarrassing or incriminating content, but there’s also cases where someone is being stalked or harassed based on things they shared online, and hiding those things from Archive users may mitigate that.
I don't think it's mentioned in an official document, but it's usually referred to as "darking".
It probably safe to assume that the same concept applies to the Wayback Machine as to the rest of IA.
Edit: Here's a page that indirectly conveys some information about it: https://archive.org/details/IA_books_QA_codes
The solution (which the Internet Archive really needs to implement) is to look at the domain registration data or something, and then only remove content if the same owner updated the robots.txt file. If not, then just disallow archiving any new content, since the new domain owner usually has no right to decide what happens to the old site content.
Anything otherwise obviates the entire mission of the Wayback Machine.
She's a prominent left-wing television personality. Would you be so accommodating if it were, say, Tucker Carlson trying to scrub embarrassing information about himself from the wayback machine?
I don't know what "start from scratch" would mean – the point is that each site is sampled many times throughout history. That said, it is very odd that a current change in robots.txt would prevent looking at old samples. And that's indeed what it looks like :
> Page cannot be displayed due to robots.txt.
The robots.txt shows a positive assertion that parts of a site should be excluded from being used by automated systems.
In most cases I imagine WBM does not have permission of the owner to keep a duplicate of the site, it's certainly tortuous in UK law.
Sites that don't change their robots.txt are probably highly correlated with sites that don't sue for the infringement.
""To remove your site from the Wayback Machine, place a robots.txt file at the top level of your site (e.g. www.yourdomain.com/robots.txt).
The robots.txt file will do two things:
1. It will remove documents from your domain from the Wayback Machine.
2. It will tell us not to crawl your site in the future.
To exclude the Internet Archive’s crawler (and remove documents from the Wayback Machine) while allowing all other robots to crawl your site, your robots.txt file should say:
Their current documentation no longer says that they stop displaying old archives automatically in the presence of an ia_archiver Disallow directive, but I have not experimented about whether they still actually do this anyway.
Just guessing wildly about similar issues, I know that news organizations which have a publicly documented "unpublish" policy tend to get that policy used aggressively by reputation management firms and the like.
I don’t think that’s it...it’s not a technical thing. Deleting all archives must be a courtesy they extend to anyone that specifically denies access to Wayback Machine in their robots.txt. Does anyone know if this is documented? If so, why didn’t her lawyers just carry out the robots.txt technique and not even bother contacting them? Most importantly, why would they have such a policy? This is all very odd.
I hang around in emulation circles and there's been some talk in the past few weeks because some Nintendo ROM archives had been taken offline from archive.org but people soon figured out that they could still access them by tinkering with the URL. The situation is a bit different here though.
Instead, you are just going to pretend that your past self never existed...
"I find gay sex to be gross" isn't that controversial of an opinion. Plenty of open-minded, accepting people agree with you. It just wasn't a worthwhile opinion to espouse...
Own it, Joy. Don't just play dumb, because now you just look dumb.
Some of them crossed the line and were no longer merely disappointing.
>claims that gay men prey on “impressionable teens”
What Joy SHOULD have done was admitted that she "used" to be a homophone, and apologized.
"homophobe" - a person with an extreme and irrational aversion to homosexuality and homosexual people.
Just a little joke about a linguistic mixup.
The better narrative would be to say:
"This is who I used to be, there were many people like me at the time, but my views have evolved and I've become close friends with many of the people that my words hurt. I'm sorry and I work every day to make up for these mistakes."
Actions speak louder than words.
If you remove the robots.txt setting, the archives become available again.
VOX for example returns a 0-sized page for archive.is. In the past VICE returned 404s to archive.is https://i.imgur.com/OnFdVpS.jpg
What I mean to say is that these services are useful but they are not faultless.
In short, before you publish a blog post that is sexist/racist/homophobic/whatever, consider that even if you delete it, others may have a copy and will use it against you.
the issue I have is, we should not be able to just block access to archived content because its embarrassing.
Sometimes I use archive.is , they don't automatically delete because of robots.txt but it's not fully clear to me when they do delete things.
This is effectively what libraries have been doing for many years with their archives of newspapers.
You have to incentivize people running and storing the hashes.
Of course there are other ways to achieve that such as publishing your checksum to a vast number of neutral third parties such as in a mailing list, bittorent or even a newspaper. You could also rely on a trusted 3rd party who has low incentives to manipulate the data (or would risk a lot if they were caught cheating) such as a bank, insurance or notary for instance.
I think archive.org could potentially do something like that by using a merkle tree containing the checksums of the various pages they've archived during the day and publish the top hash every day for anybody to archive (or publish it on a blockchain or whatever, as said above). If later on somebody accuses the archive of manipulation they can publish the merkle tree of the day the site was archived which contains the checksum of the page, and anybody having the top hash of that day can vouch that it is indeed correct.
It doesn't stop the archive from storing bogus data but it makes it impossible to "change the past" post-facto, so in this particular situation the journalist could only complain that the archive stored bogus data back in the day and not that it was recently tampered with.
Realistically it might be overkill though, simply setting up some mailing list where anybody can subscribe and be sent the checksum every day or even just publishing it at some URL and letting users scrap it if they want might be sufficient. If we're talking about one checksum every day it's only a few kilobytes every year, it shouldn't be too difficult to convince a few hundred people and organizations around the world to mirror it.
HTTPS does not sign the content. It MACs the content.
As an aside, I’ve been noticing Google has been getting worse and worse at finding something I’m sure is out there. I’m not saying that they’re necessarily getting worse, but maybe it’s getting harder to deal with the sheer scale of the web these days.
Clearly her domain is defunct - but I got suckered and actually came here to say things like "what terrible journalistic standards" before double checking.
As my old domains fall into disrepair I guess I will need to archive them to S3 and keep up the payments just to stop this happening.
An interesting problem - and possibly a revenue source for archive.org?
Hang on - the article on archive says (someone) added a robots.txt to block them. But the blog.reidreport.com is parked on some crappy redirect thing.
Whois says that email@example.com still owns the domain - so I think she has got some very very bad advice from her hosting company. And my point still stands - a domain name is a reputation, and it is for life, not just for christmas.
I think I'd be concerned about your client redirecting you to a squatter page.
Why would the robots file on an active site be applied to the archived content?
Google’s distributing more content than archive.is.
Lawsuits are spendy. TIA have Streisanded the issue.
If I put a file named "legal.txt" in an online folder, is anyone required to read it and act upon it? It might as well be a file intended for some completely unrelated purpose; e.g. a lawyer that put some drafts online, or for all the reader knows, it might even be part of a movie script.
In most cases, copyright law requires the reader of a document not to republish it, so the robots.txt standard is actually much more permissive.
Perhaps having a bot generate a synopsis of removed content, and showing that in its place would solve any copyright issue fairly elegantly?
You learn something new every day... Multiple times a day in my case!
Who had motive to alter the posts in question? Who had the opportunity? When could it have happened? What method did they use to do so?
If Reid's team cannot plausibly answer those questions, we are still examining the simplest hypothesis, and have seen no plausible evidence that it should be refuted.
If we are to believe that those posts were written by someone else posing as Reid, would that suspicion not apply equally to everything appearing on her blog now? In which case, the solution has always been to sign the post using public-private asymmetric cryptography and to employ a public timestamp server to verify the time of publication.
Yes, but it seemed like they had changed their mind, exactly because there is a huge issue with "expired" domains, see:
They experimentally ignored robots.txt on .mil and .gov domains, and I thought they were going to extend this new policy for all archived sites.
The situation/status is not clear, though the retroactive validity of robots.txt remains (at least to me) absurd.
It is IMHO only fair to respect a robots.txt since the date it has been put online, it is the retroactivity that is perplexing, as a matter of fact I see it as violating the decisions of the Author, that - at the time some contents was made available - by not posting a robots.txt expressed the intention to have the contents archived and accessible, while there is no guarantee whatsoever that the robots.txt posted years later is still an expression of the same Author.
Most probably a middle way would be - if possible technically - that the robots.txt is respected only for the period in which the site has the same owner/registrar, but for the large amount of sites with anonymous or "by proxy" ownership that could not possibly work.
In pretty much all other cases--except where they were public domain or CC0--it's probably not strictly legal to archive them at all. Therefore, it makes sense to bend over backwards to remove any material if asked to programatically or otherwise.
>I see it as violating the decisions of the Author
Maybe in some cases. But, for better or worse, preventing crawling is opt-in rather than opt-out, and defaults are very powerful. You didn't explicitly tell me that you didn't want me to repurpose your copyrighted material isn't a very strong legal argument.
This would also solve situations where a new owner blocks robot access for a domain where the former owner is OK with the existence of the archived site.
It seems to make the most sense to only have a robots.txt affect pages archived when that specific version of robots.txt is in effect.
At the time of crawling the robots.txt is parsed anyway.
If it excludes part or the whole site from crawling, it should IMHO be respected, and the crawl of that day should be stopped (and pages NOT even archived) if it doesn't then crawling and archiving them is "fair".
The point here is that by adding a "new" robots.txt the "previously archived" pages (that remain archived) are not anymore displayed by the Wayback Machine.
It is only a political/legal (and unilateral) decision by the good people at the archive.org, it could be changed any time, at their discretion, without the need of any "new" syntax for robots.txt.
But that is easily achieved by politely asking the good people at archive.org, they won't normally decline a "reasonable" request to suppress this or that page access.
As a side note, there is something that (when it comes to the internet) really escapes me, in the "real" world, before everything was digital, you had lawful means to get a retraction in case - say - of libel but you weren't allowed to retroactively change the history, destroying all written memories and attempts like burning books on public squares weren't much appreciated, I don't really see how going digital should be so much different.
I guess that the meaning of "publish" in the sense of "making public by printing it" has been altered by the common presence of the "undo" button.
Another sign I am getting old (and grumpy), I know.
What would be interesting would just be storing a Merkle tree for archive hashes so many parties could verify that a much smaller number of copies haven’t been modified.
A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to firstname.lastname@example.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.
We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user’s perspective.
The author will then be unable to refute (or able to) the authorship of the relevant text.
If I have my own blog on my own domain, and Google and Wayback Machine archives it, can I request them to delete it one year later under GDPR?
You can request they delete any personal/identifiable data. So any references to names, email, pictures of you etc but probably not content.
Almost all my blogging is about tech, not PII at all.
OK, the way I read, author--one way or another--asked for her blog not being hosted by Wayback machine and they declined. It's my work, as long as I can verify that I wrote, they should take it down or be sued for copyright infringement.
I get the "we're archiving the internet," but if I want that post where I said Google is evil taken down because I have a G job interview a week from now, they should take it down. Another thing, just because I have a page online, doesn't mean that I gave them consent to archive it for eternity.
I get the robots.txt, but if you're archiving you should ask for permission, they are a gazillion robots out there.
Imagine an author giving you a copy of the book and then 15 years later coming to your home library and asking for it back.
It’s cool to not post publicly or to restrict access. But releasing and then yanking doesn’t make sense.