
Addressing Recent Claims of “Manipulated” Blog Posts in the Wayback Machine - edward
https://blog.archive.org/2018/04/24/addressing-recent-claims-of-manipulated-blog-posts-in-the-wayback-machine/
======
nkurz
They aren't very prominent, but the two links at the top of the blog post
provide useful context:

[https://www.mediaite.com/online/exclusive-joy-reid-claims-
ne...](https://www.mediaite.com/online/exclusive-joy-reid-claims-newly-
discovered-homophobic-posts-from-her-blog-were-fabricated/)

[https://theintercept.com/2018/04/24/msnbcs-joy-reid-
claims-h...](https://theintercept.com/2018/04/24/msnbcs-joy-reid-claims-her-
website-was-hacked-and-bigoted-anti-lgbt-content-added-a-bizarre-story-
liberal-outlets-ignore/)

The summary is that last year Reid (a journalist for a major US television
news network) publicly apologized for a series of blog posts that were
characterized as 'homophobic'. More posts of a similar nature were recently
discovered on archive.org, and instead of apologizing for these as well, Reid
has disavowed them. She and her lawyers claim that unlike the previous
occasion, these newly discovered posts were altered by 'hackers' either before
or after being archived. The linked blog post is making the limited claim the
posts on archive.org accurately represent the posts present on Reid's site at
the time they were archived, and do not appear to have been altered post-
archiving.

~~~
onion2k
_The linked blog post is making the limited claim the posts on archive.org
accurately represent the posts present on Reid 's site at the time they were
archived, and do not appear to have been altered post-archiving._

This might actually be a good use case for a blockchain. Hashing the data
that's added to the archive and then putting the hash in the blockchain would
reasonably prove the data in the archive hasn't been modified at a later date.

~~~
etskinner
Forgive my shallow understanding of block chain, but wouldn't that make the
archive immutable? Surely there are times where the Wayback Machine needs to
delete snapshots, in cases where there's copyright infringement or other
illegal activity.

~~~
beager
This is also an issue for major blockchains in deployment now, specifically
Bitcoin. There is the potential for illegal content, or links to it, to be
stacked on BTC’s blockchain [0], and so anyone who holds that blockchain would
also possess it.

I believe this would also be an issue for things like Filecoin/IPFS but I’m
not sure if the liability issues are different or nuanced.

[0]
[https://www.theregister.co.uk/2018/03/19/ability_to_dump_ill...](https://www.theregister.co.uk/2018/03/19/ability_to_dump_illegal_content_in_bitcoins_blockchain_puts_participants_in_peril/)

~~~
AgentME
IPFS works like torrents: users only host things that they choose to, so
there's no issue of some people being stuck hosting content they don't want
to.

------
igorkraw
For what it's worth, I have habitually been saving things with
[https://addons.mozilla.org/en-us/firefox/addon/save-page-
we/...](https://addons.mozilla.org/en-us/firefox/addon/save-page-
we/reviews/847893/) since last year. Though I have some other motivations,
having a git repo with auto snapshots every minute and continously
snapshotting HN discussions and changing newssites has been giving me a "I'm
gonna analyze the _shit_ out of this" vorfreude since 2 months (don't know the
english word, german for literally "pre-happiness")

~~~
y4mi
as a fellow german:

giddiness maps wondefully in that sentence.

but to stay rational: 'looking forward to', 'excited for' are the more
objective terms.

~~~
igorkraw
Dankesehr

~~~
IncRnd
horripilation

------
myhrvold
The way I read it... the Wayback Machine did not allow archives to be taken
down arbitrarily -- but a subsequent targeted robots.txt exclusion of the
Wayback Machine could render prior archives of that website moot? (Because the
Wayback Machine starts from scratch each time?)

~~~
ry_ry
If that's the case you could buy up defunct domains, exclude everything via
robots.txt and selectively purge sites from archive.org

That seems like a pretty glaring flaw in something designed to create an
enduring record.

~~~
proaralyst
I don't think they purge the archives, I think they just don't serve them on
the wayback machine.

~~~
pronoiac
Yes. Instead of deleting anything, I think the Archive tends to mark stuff as
"do not show this for a few decades."

~~~
jake-low
Do you have a source for this? I didn’t know that but it’s very interesting –
a good compromise between the interests of current website owners[1] and
future historians.

[1]: Sure, some people just want to hide embarrassing or incriminating
content, but there’s also cases where someone is being stalked or harassed
based on things they shared online, and hiding those things from Archive users
may mitigate that.

~~~
aepiepaey
Generally when items are "taken down" from the Internet Archive, they just
stop being published, and are not deleted.

I don't think it's mentioned in an official document, but it's usually
referred to as "darking".

It probably safe to assume that the same concept applies to the Wayback
Machine as to the rest of IA.

Edit: Here's a page that indirectly conveys some information about it:
[https://archive.org/details/IA_books_QA_codes](https://archive.org/details/IA_books_QA_codes)

------
RIMR
Come on, Joy. You blog posts weren't even that bad, they were just in poor
taste. You didn't say anything particularly vitriolic or hateful. This is your
opportunity to admit that these were once your views, and emphasize your
personal growth since then.

Instead, you are just going to pretend that your past self never existed...

"I find gay sex to be gross" isn't that controversial of an opinion. Plenty of
open-minded, accepting people agree with you. It just wasn't a worthwhile
opinion to espouse...

Own it, Joy. Don't just play dumb, because now you just look dumb.

~~~
magic_beans
You're not wrong. Her blog is tame compared to some of the comments I see
DAILY in ANY subreddit on reddit.

What Joy SHOULD have done was admitted that she "used" to be a homophone, and
apologized.

~~~
apetresc
She's still a homophone; Reid sounds exactly like "read".

~~~
accoil
Isn't that the point though? She wasn't a homophone in the past.

~~~
pchristensen
"homophone" \- each of two or more words having the same pronunciation but
different meanings, origins, or spelling, e.g., new and knew.

"homophobe" \- a person with an extreme and irrational aversion to
homosexuality and homosexual people.

Just a little joke about a linguistic mixup.

~~~
accoil
:) Might just be my pronunciation, but Reid only sounds like the present form.

------
parliament32
Wait so adding a robots.txt exclusion for the Wayback Machine makes all
previous archives of the site inaccessible? That's very odd behaviour, and
really not the point of a robots.txt file... I would expect a robots.txt to
control a bot's visits / scraping behaviour, not a site's history.

~~~
adventured
Yes. They use the robots.txt file to essentially ascertain ownership or
control of the domain. Archive.org doesn't want to delete the content they
have (for whatever reason), so the compromise they came up with is to read the
robots.txt and then hide the content they have archived if the present domain
owner/controller wants it to be that way.

If you remove the robots.txt setting, the archives become available again.

------
rmason
So if you find something you'd better make a copy of it yourself because it
might be going dark. Doesn't that kind of defeat the whole purpose of the
Wayback Machine?

~~~
lzy
So I try to make a copy of any interesting web pages on archive.is these days.

[http://archive.is/faq#Why_does_archive_is_not_obey_robots_tx...](http://archive.is/faq#Why_does_archive_is_not_obey_robots_txt_)

~~~
nsbq71
This one is funny, because conservatives have used archive.is for some time to
archive and mock left-leaning websites and some of them blocked archive.is in
the past and still block archive.is today.

VOX for example returns a 0-sized page for archive.is. In the past VICE
returned 404s to archive.is
[https://i.imgur.com/OnFdVpS.jpg](https://i.imgur.com/OnFdVpS.jpg)

What I mean to say is that these services are useful but they are not
faultless.

~~~
SmellyGeekBoy
Why are so many irrelevant political left-vs-right "he said, she said" type
comments popping up on HN just lately?

~~~
fapjacks
I personally think this is simply a result of how much harder the media is
pushing that divide (for all of their various purposes). I actually spent some
time last year researching this, because I thought I might have just become an
old man thinking how great things used to be. I started reading old news
stories fairly randomly, from the present time all the way back to the Vietnam
era (and a few rabbit holes to earlier times). The first thing that surprised
me was the amount of link rot that exists. I always knew intellectually that
it was a problem, but wow. It's bad. The second thing that I found was that
indeed, the media hammers on the "us-versus-them" political divide of American
politics much, much harder nowadays than even ten years ago. I think Fox News
was really the turning point. It opened the flood gates. I always remember
thinking how "extreme" Fox News was, but I challenge anyone to look up a few
of their older stories from the middle of the last decade. It's child's play
compared to what pretty much every media outlet is doing today. You can hardly
read a recent news story from just about _anywhere_ without being told how
it's supposed to fit into our political worldview, and how we should feel
about it, and why it's good/bad/stupid/amazing/"terrifying". And so of course,
because of this, people are just responding to the programming. Creating the
world they're led to believe we they live in. I think it really is that
straightforward.

~~~
deciplex
Did you just look at print? Talk radio has been hammering this since the late
eighties. Hell you can probably draw a line straight from the "Moral Majority"
shit in the seventies, to where we find ourselves now. I suspect this has
always been a big part of American culture, but it's being magnified now
either by new tech or malicious actors or both.

~~~
fapjacks
Oh, you know, that's interesting. I hadn't even thought about talk radio, but
you're absolutely right.

------
lifeisstillgood
This highlights to me something about long-term management of my domains.
"blog.reidreport.com" is now run by some domain squatter - and knowing nothing
about Reid or reid report I took a vaguely generic website at face value - and
clicked on the heavily disguised paid adverts.

Clearly her domain is defunct - but I got suckered and actually came here to
say things like "what terrible journalistic standards" before double checking.

As my old domains fall into disrepair I guess I will need to archive them to
S3 and keep up the payments just to stop this happening.

An interesting problem - and possibly a revenue source for archive.org?

EDIT: Hang on - the article on archive says (someone) added a robots.txt to
block them. But the blog.reidreport.com is parked on some crappy redirect
thing.

Whois says that joyannreird@gmail.com still owns the domain - so I think she
has got some very very bad advice from her hosting company. And my point still
stands - a domain name is a reputation, and it is for life, not just for
christmas.

~~~
heartbreak
Are you sure you're spelling the URL correctly? `blog.reidreport.com` (as you
spelled it in your post) redirects to a Blogger.com "Permission denied" page.
Not a squatter.

I think I'd be concerned about your client redirecting you to a squatter page.

------
ry_ry
That was... vague

Why would the robots file on an active site be applied to the archived
content?

~~~
zxcmx
Legal issues. Who owns the content? There is no real legal basis for a right
to mirror, despite how it feels from a techy point of view.

~~~
ashelmire
Fair use. It’s for documentation, educational, and research purposes.

~~~
JeremyBanks
There are limits on the amount of content you can redistribute under "fair
use" for a given purpose. I'm not sure if redistributing half of the internet
would be legally justifiable.

~~~
prepend
It’s certainly a good Supreme Court case. EFF/ACLU should be able to cover
this case.

Google’s distributing more content than archive.is.

------
logfromblammo
Applying the various razors, I find that the hypothesis that would need to be
refuted first is that Reid wrote the posts, they were archived as written,
unaltered since then, and she simply does not wish to take responsibility for
their contents now.

Who had motive to alter the posts in question? Who had the opportunity? When
could it have happened? What method did they use to do so?

If Reid's team cannot plausibly answer those questions, we are still examining
the simplest hypothesis, and have seen no plausible evidence that it should be
refuted.

If we are to believe that those posts were written by someone else posing as
Reid, would that suspicion not apply equally to everything appearing on her
blog now? In which case, the solution has always been to sign the post using
public-private asymmetric cryptography and to employ a public timestamp server
to verify the time of publication.

------
aviv
The robots.txt exlusion loophole has been known for quite a long time.

~~~
jaclaz
>The robots.txt exlusion loophole has been known for quite a long time.

Yes, but it seemed like they had changed their mind, exactly because there is
a huge issue with "expired" domains, see:

[https://blog.archive.org/2016/12/17/robots-txt-gov-mil-
websi...](https://blog.archive.org/2016/12/17/robots-txt-gov-mil-websites/)

[https://blog.archive.org/2017/04/17/robots-txt-meant-for-
sea...](https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-
engines-dont-work-well-for-web-archives/)

They experimentally ignored robots.txt on .mil and .gov domains, and I thought
they were going to extend this new policy for all archived sites.

The situation/status is not clear, though the retroactive validity of
robots.txt remains (at least to me) absurd.

It is IMHO only fair to respect a robots.txt since the date it has been put
online, it is the retroactivity that is perplexing, as a matter of fact I see
it as violating the decisions of the Author, that - at the time some contents
was made available - by not posting a robots.txt expressed the intention to
have the contents archived and accessible, while there is no guarantee
whatsoever that the robots.txt posted years later is still an expression of
the same Author.

Most probably a middle way would be - if possible technically - that the
robots.txt is respected only for the period in which the site has the same
owner/registrar, but for the large amount of sites with anonymous or "by
proxy" ownership that could not possibly work.

~~~
gmueckl
Is it reasonably feasible to extend the syntax of robots.txt to include date
ranges when the entries are specific to the IA bot? That way, specific content
from a certain time span could be retroactively suppressed if desired.

This would also solve situations where a new owner blocks robot access for a
domain where the former owner is OK with the existence of the archived site.

~~~
rocqua
Why allow retroactive suppression at all?

It seems to make the most sense to only have a robots.txt affect pages
archived when that specific version of robots.txt is in effect.

~~~
cheschire
Perhaps this article[0] may provide you with insight into the motivations of
those who may prefer to suppress historical data.

0:
[https://en.wikipedia.org/wiki/Right_to_be_forgotten](https://en.wikipedia.org/wiki/Right_to_be_forgotten)

------
ashelmire
So who has the old version of the blog posts in question, so we can see what
this journo has to hide?

~~~
disk0
This is a good blog post to link to those that actually believe Joy Reid in
this case

[http://ws-dl.blogspot.co.uk/2018/04/2018-04-24-why-we-need-m...](http://ws-
dl.blogspot.co.uk/2018/04/2018-04-24-why-we-need-multiple-web.html)

------
zeveb
This, right here, is why the GDPR's 'right to be forgotten' is so pernicious.
Were Mrs. Reid an EU person, the Internet Archive could be forced to disappear
her previous posts. Or given that she's _also_ a public figure, would it be
permitted to retain them? Only a court could decide.

------
JustSomeNobody
I don't see this ending well for her. She should have just come clean and
apologized ... again.

------
exolymph
IPFS wayback machine when

~~~
acdha
When it scales multiple orders of magnitude better? Web archives are massive
and have non-trivial storage costs.

What would be interesting would just be storing a Merkle tree for archive
hashes so many parties could verify that a much smaller number of copies
haven’t been modified.

------
BeetleB
A bit surprised at the comments here. As far as I can remember, robots.txt has
always been used by the Wayback Machine this way. Often a bummer when the
original domain expires and a domain squatter takes over - many have
robots.txt

~~~
dredmorbius
That's changed.

 _A few months ago we stopped referring to robots.txt files on U.S. government
and military web sites for both crawling and displaying web pages (though we
respond to removal requests sent to info@archive.org). As we have moved
towards broader access it has not caused problems, which we take as a good
sign. We are now looking to do this more broadly._

 _We see the future of web archiving relying less on robots.txt file
declarations geared toward search engines, and more on representing the web as
it really was, and is, from a user’s perspective._

[https://teleread.org/2017/04/24/the-internet-archive-will-
so...](https://teleread.org/2017/04/24/the-internet-archive-will-soon-stop-
honoring-robots-txt-files/)

[https://blog.archive.org/2017/04/17/robots-txt-meant-for-
sea...](https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-
engines-dont-work-well-for-web-archives/)

------
themihai
Bottom line: You can remove the content you own using robots.txt

------
csomar
Signed hashes with blockchain time stamping could be a solution to that. A
standard format for content and publishing content (whether a blogpost or a
tweet), then based on that block of text you provide a digital signature
linked to your identity. The signature is published on the blockchain, though,
this is optional.

The author will then be unable to refute (or able to) the authorship of the
relevant text.

------
21
What's the GDPR impact on this?

If I have my own blog on my own domain, and Google and Wayback Machine
archives it, can I request them to delete it one year later under GDPR?

~~~
mercer
Wayback Machine does respect robots.txt, I think. I'm curious what happens if
you lose control of your domain, though.

~~~
kevingadd
If you lose control the whole history of the site gets nuked by a new
robots.txt. This has happened in a few notable cases fairly recently.

------
wpdev_63
I believe it.

------
onetimemanytime
_> >...we declined to take down the archives._

OK, the way I read, author--one way or another--asked for _her_ blog not being
hosted by Wayback machine and they declined. It's my work, as long as I can
verify that I wrote, they should take it down or be sued for copyright
infringement.

I get the "we're archiving the internet," but if I want that post where I said
Google is evil taken down because I have a G job interview a week from now,
they should take it down. Another thing, just because I have a page online,
doesn't mean that I gave them consent to archive it for eternity.

I get the robots.txt, but if you're archiving you should ask for permission,
they are a gazillion robots out there.

~~~
IAmEveryone
Section 108 exception to copyright protection for public ally-accessible
archives:
[https://www.law.cornell.edu/uscode/text/17/108](https://www.law.cornell.edu/uscode/text/17/108)

~~~
ghaff
That exception is for physical works within a library. It's basically the only
thing that makes libraries/archives special relative to you or I with respect
to copyright.

