
Wiley uses fake DOIs on its website as a crawl trap - okket
https://docs.google.com/document/d/1uTVHPI8r4VO31KihsyiBHsh_gp8jZ38fMvP5nP5XOkw/
======
ChuckMcM
This certainly points out one of the many difficulties of crawling the web.
Wiley has set up their robots.txt correctly but someone else duplicates their
content, which just has links.

The author was crawling[1] and as such should have checked Wiley's robots.txt
file which disallows it. So The "academic mining project" should have
contacted Wiley directly and said, "Hey we're linking together a bunch of DOIs
we would like to automatically fetch them from your site." And Wiley would
have said "Sorry, nope." or if they weren't complete idiots they might say
"Ok, use this user agent, XXX, and at most this crawl rate XXX. Respect the
robots.txt file and when the user agent is disallowed again you have to stop
crawling."

Since they didn't do that, they violated the terms of service for Wiley and
Wiley back tracked the IP to the university and called them to find out what
was up with that. (Part of their "aaron defense" no doubt)

[1] "I attempted to use these DOIs for a legitimate academic mining project in
good faith (one which I had pre-informed the library about)."

[2] -
[http://onlinelibrary.wiley.com/robots.txt](http://onlinelibrary.wiley.com/robots.txt)

~~~
glimmung
If I read this right, surely the point is that he was not crawling the /Wiley/
site, but rather something else that contained those links? This is the thing
about deliberate contamination - it spreads.

~~~
ChuckMcM
My reading of the article has him accessing the Wiley site using a crawler. He
may have sourced the fake DOIs from some other seed document but he didn't
actually try to fetch them with a browser.

He states that he accessed the site from the library so that he had access to
Wiley online (as part of the Library's access arrangement). And he states that
he was building tools for automatically linking documents. (tool, not
browser).

