
American Chemical Society bans university after "spider-trap" is clicked - danso
http://blogs.ch.cam.ac.uk/pmr/2014/04/02/acsgate-pandora-opens-the-american-chemical-societys-box-and-her-university-gets-cut-off/
======
freshyill
It's worth noting that many journals don't control the platform their
scholarly content. It looks like ACS uses this
[Atypon]([http://www.atypon.com](http://www.atypon.com)). That's the likely
source of this spider trap, not ACS.

Atypon has [a relatively small client list]([http://www.atypon.com/our-
clients/featured-clients.php](http://www.atypon.com/our-clients/featured-
clients.php)). Compare it to
[Highwire]([http://highwire.stanford.edu/lists/allsites.dtl](http://highwire.stanford.edu/lists/allsites.dtl)).
I'd be willing to bet that all journals hosted with Atypon share this spider
trap—even journals that are supposed to be open access where spidering
_should_ be OK.

Scientific publishing is weird. Source: I work in scientific publishing.

~~~
greglindahl
In Highwire's case, they typically have a robots.txt blocking everyone but
Google ... and the reason is not malice, it's inefficient software. Fetching a
page once every few seconds is enough to overload their system.

~~~
AYBABTME
A problem with robots is that they will scan all of a website. Normal user
traffic is somewhat focused and is easy to cache. When one robot comes and
scans everything at once, it brings a bunch of unpopular pages in the cache,
possibly evicting more popular ones in the process. The more popular one will
then need to be re-cached as requests come back for it.

If you avoid caching requests by robots, then instead you end up having to go
through all the layers of your app, possibly going in the database.

In most situation, I don't think the above matters much. But I can see why it
could be a worst case for some stacks.

~~~
batbomb
That's easily detectable and mitigable via a MRU cache, or no caching at all.

------
s_q_b
Is it bad that I'm just as insulted by the so-called "spider trap"? It's so
technologically simple as to be useless against anyone who could deploy a web
scraper in the first place.

I mean, it's marked by comment tags that say "spider trap" right on them! Its
the worst type of disambiguation system: likely to generate false positives,
unlikely to catch real violators.

~~~
PavlovsCat
Yet the off-the-shelf bots that are just let loose on the web in general will
likely fall for it, as long as the "spider trap" is not off-the-shelf itself;
and the ones actually targetted at just you you likely can't defeat anyway.

------
Kliment
Note how this means that anyone who is tricked into clicking that link has
just blacked-out their entire institution. This has massive potential for
abuse.

~~~
naich
They have posted a reply here:
[http://www.asihcopeiaonline.org/doi/pdf/10.1046/9999-9999.99...](http://www.asihcopeiaonline.org/doi/pdf/10.1046/9999-9999.99999)

~~~
nitrogen
That looks like the spider trap link.

~~~
naich
That's the joke.

~~~
nitrogen
This sort of Slashdot-esqe misdirection is not exactly appropriate on HN. It
starts with black-holing universities with a link mislabeled as a response,
but could quickly devolve into misdirections to other black holes of Slashdot
that (I assume) we really don't want here.

~~~
sebcat
As a user of HN, don't tell me or others what's appropriate, thanks.

~~~
dang
Nitrogen is right. This sort of Slashdot-esqe misdirection is not exactly
appropriate on HN.

------
dalke
arXiv.org, back when it was still xxx.lanl.gov had a similar trap. Yes, I
clicked on it. It gave a warning of the sort "don't this again, here's what's
happening, if we see many more requests from your site then we'll shut off
access."

This was in the late 1990s.

~~~
HCIdivision17
I still remember that page. As a middle schooler who didn't know anything from
anything, it was a perplexing thing. The site's got an xxx at the front, but
looks like a legit government site from wait, Los Alamos? Like from "Surely
You're Joking Mr. Feynman"? Oh jeez, I'm gonna get in trouble with the
school...

------
PaulHoule
Funny, we used to do this when I was working at arXiv.org. We had incessant
problems with robots that didn't obey robots.txt so we needed spider traps to
keep the site from going down.

------
SixSigma
That's some level of incompetence - the trappers I mean. A half arsed solution
because they couldn't think of a better one. A registration system with
abstracts and unlock-this-article links would be a better one, off the top of
my head.

~~~
freshyill
I'm willing to bet that they provide site licenses, where everyone in an
entire university's subnet range might have access. In an open access journal,
it _shouldn 't_ matter, but many journals are hosted on the same few
platforms, and the spider trap is a feature of the platform.

------
danso
Reporting the content since site is down:

Tl;dr: researcher is browsing source code of a research paper's web page and
finds a strange link (but same domain). She clicks and is informed that her IP
is banned for automated spidering.

Apparently, this research site is meant to be open-access...

\-------

Pandora is a researcher (won’t say where, won’t say when). I don’t know her
field – she may be a scientist or a librarian. She has been scanning the
spreadsheet of the Open Access publications paid for by Wellcome Trust. It’s
got 2200 papers that Wellcome has paid 3 million GBP for. For the sole reason
to make them available to everyone in the world. She found a paper in the
journal Biochemistry (that’s an American Chemical Society publication) and
looked at
[http://pubs.acs.org/doi/abs/10.1021/bi300674e](http://pubs.acs.org/doi/abs/10.1021/bi300674e)
. She got that OK – looked to see if they could get the PDF -
[http://pubs.acs.org/doi/pdf/10.1021/bi300674e](http://pubs.acs.org/doi/pdf/10.1021/bi300674e)
\- yes that worked OK.

What else can we download? After all this is Open Access, isn’t it? And
Wellcome have paid 666 GBP for this “hybrid” version (i.e. they get
subscription income as well. So we aren’t going to break any laws…

The text contains various other links and our researcher follows some of them.
Remember she’s a scientist and scientists are curious. It’s their job. She
finds: <span id="hide"><a href="/doi/pdf/10.1046/9999-9999.99999"> <!-- Spider
trap link --></a></span> Since it's a bioscience paper she assumes it's about
spiders and how to trap them.

She clicks it. Pandora opens the box... Wham!

The whole university got cut off immediately from the whole of ACS
publications. "Thank you", ACS

The ACS is stopping people spidering their site. EVEN FOR OPEN ACCESS. It
wasn't a biological spider. It was a web trap based on the assumption that
readers are, in some way, basically evil.. Now _I_ have seen this message
before. About 7 years ago one of my graduate students was browsing 20
publications from ACS to create a vocabulary. Suddenly we were cut off with
this awful message. Dead. The whole of Cambridge University. I felt really
awful.

I had committed a crime. And we hadn't done anything wrong. Nor has my
correspondent. If you create Open Access publications you expect - even hope -
that people will dig into them. So, ACS, remove your spider traps. We really
are in Orwellian territory where the point of Publishers is to stop people
reading science.

I think we are close to the tipping point where publishers have no value
except to their shareholders and a sick, broken, vision of what academia is
about.

UPDATE: See comment from Ross Mounce: The society (closed access) journal
‘Copeia’ also has these spider trap links in it’s HTML, e.g. on this contents
page:[http://www.asihcopeiaonline.org/toc/cope/2013/4](http://www.asihcopeiaonline.org/toc/cope/2013/4)

you can find

<span id="hide"><a href="/doi/pdf/10.1046/9999-9999.99999"> <!-- Spider trap
link --></a></span>

I may have accidentally cut-off access for all at the Natural History Museum,
London once when I innocently tried this link, out of curiosity. Why do
publishers ‘booby-trap’ their websites? Don’t they know us researchers are an
inquisitive bunch? I’d be very interested to read a PDF that has a
9999-9999.9999 DOI string if only to see what it contained – they can’t
rationally justify cutting-off access to everyone, just because ONE person
clicked an interesting link? PMR: Note - it's the SAME link as the ACS uses.
So I surmise that both society's outsource their web pages to some third-party
hackshop. Maybe 10.1046 is a universal anti-publisher.

PMR: It's incredibly irresponsible to leave spider traps in HTML. It's a human
reaction to explore.

------
gmisra
Seems like an easy way for a university-based "conscientious objector" to have
this issue addressed would be to intentionally click on the spider trap link
once a day?

~~~
logfromblammo
Too much work. Wget with cron. Then you can click every day without having to
click every day.

------
specialp
I work for a (non profit) journal publisher and we do indeed cut off robot
downloading but not after one click of a link. We analyze traffic to determine
robot downloads. I suspect though that the whole entire university did not get
cut off in this incident. Usually it is on a per IP basis and unless the
University proxies all of their journal traffic through a single IP which is
not common I think saying the whole university being blocked may be an
exaggeration. I personally wish we had no robot monitor but then again we
would get heavy spidering then of large files.

~~~
dllthomas
Is there reason to block instead of throttling?

~~~
specialp
We do have a CAPTCHA too before the block. Basically to get the block you have
to really work hard at it. We also do not mind limited robot use for cases
like downloading all papers given a search term or author but we do not want
people downloading our entire corpus either. So throttling is not an option.

I think the case mentioned in the article is definitely heavy a heavy handed
approach. When it comes down to it at my place we are just trying to block the
wget -r's of the world.

~~~
zAy0LfpBZLC8mAC
Is there any reason why you don't want people downloading your entire corpus?

------
raverbashing
Doesn't Chrome pre-load links as well?

Not sure it checks for styling before prefetching them.

~~~
nraynaud
no I think this plan was scrapped, too dangerous (too many websites had
stateful actions as GET). I think they just stuck to pre-loading DNS.

edit: and i was messing the webstats for advertisement.

~~~
aidenn0
We had an internal wiki where the "delete article" link was a GET. Then
someone wrote a crawler for it and deleted the entire wiki in 15 minutes. It
was changed to a POST after that.

~~~
vijayp
Heh, this reminds of a story many years ago at Google, where we got angry
messages from some guy complaining that Google kept deleting all the photos
from his online album.

We eventually figured out he his online album had an unprotected "delete this
photo" endpoint via a GET, and no robots restriction! We eventually had to fix
the crawler to detect things like this...

~~~
userbinator
I bet Google gets tons of these:
[http://thedailywtf.com/Articles/The_Spider_of_Doom.aspx](http://thedailywtf.com/Articles/The_Spider_of_Doom.aspx)

------
owenversteeg
For anyone that can't load the page, here's the site from Google's cache:
[http://webcache.googleusercontent.com/search?q=cache:_EBW_po...](http://webcache.googleusercontent.com/search?q=cache:_EBW_poxSLoJ:blogs.ch.cam.ac.uk/pmr/2014/04/02/acsgate-
pandora-opens-the-american-chemical-societys-box-and-her-university-gets-cut-
off/+&cd=1&hl=en&ct=clnk&gl=us)

~~~
jrochkind1
Oddly, the google cache version won't load for me either. The google cache
header is there, but the content area is blank, with the chrome status bar
saying "Waiting for blogs.ch.cam.ac.uk".

Looking at the source... there's some weird things going on, I think maybe the
_original_ page loaded it's content with Javascript, and the google cached
version is just the JS skeleton, waiting on trying to load JS from the
original (overloaded) site which will actually load the content?

Ugh. The trend for JS-dependent sites for simple content breaks the web,
people.

~~~
MertsA
No, you need to click the text only link in the header or it will still try to
load images and will eventually time out.

------
a3n
It would be interesting to see what conversations might happen if lots of
people from lots of universities clicked on these traps.

------
k2enemy
The warning message returned by the spider-trap says that it banned a
particular IP address. How does this cut off the entire university? Is
everyone behind a NAT?

~~~
TillE
For licensing purposes, they'd need to be able to associate ranges of IP
addresses with a specific institution. So if they want to, it's easy to block
that whole license for one violation.

------
DangerousPie
This is an important topic, but that blog entry was not very well written. If
I hadn't heard about this before already I would have been very confused what
they actually wanted to say with this convoluted story.

~~~
dang
Can anyone suggest a better url? If so, I'll change it.

------
gcb0
1\. get university with good ties to ACLU and other such movements.

2\. subscribe

3\. click link

4\. sue them for breach of contract and damages. (they didn't deliver the
content you paid for, it damaged your main source of income: providing
knowledge to paying students)

5\. repeat.

------
ChuckMcM
Sigh, did no one notice that the link is in a <span id="hide"> ? Look at the
style sheet and note that class 'hide' sets the link to be the same color as
the background (it makes it invisible to humans) and yet it got clicked on
anyway.

There are bad actors out there, they exploit services, and one of the ways the
services detect them is to create situations that a script would follow but
that a human would not. When they do something bad you've got a couple of
choices, cut them off or lie to them (some of the Bing markov generated search
pages for robots are pretty fun))

So she sends an email to the address provided, they talk to her, she gets
educated and they re-enable access. If it happens again the issue gets
escalated. Its the circle of fraud.

~~~
abruzzi
Some people also override site based CSS with their own, which could likely
make the link that was intended to be hidden come unbidden. Most browsers I've
used have that option.

~~~
zAy0LfpBZLC8mAC
There even are browsers out there that simply don't support CSS.

~~~
dllthomas
Lynx seems a likely candidate, though I'm not positive it has zero support.

~~~
pseut
Konqueror on RHEL seems more likely than Lynx.

------
joshdance
Tack spider traps and booby trapped documents to the long list of scientific
publishing problems.

------
fit2rule
This is only interesting for as long as ACS is asleep at the wheel.

Lets wait and find out how long it takes them to respond to the inevitable
interest that 999999.99999 people will have sent their way ..

------
userbinator
I sometimes get a similar message from Google (maybe it's due to the search
queries I use...), but they provide a CAPTCHA so you can (reasonably) show
that you're a human.

------
patcon
Some asshole just discovered a whole new reason to wardrive...

------
keithgabryelski
It's odd that at the top of the article the author claims Pandora might be a
scientist or a librarian (but they won't reveal such things) Then later claims
they looked at the hidden link because they were curious (because scientists
are curious). Maybe someone should have re-read their text for consistency.

------
obastemur
last 5 mins I'm trying to reach this link. the website is not reachable any
more. How many people are trying to do the same ?

~~~
pbhjpbhj
[http://blogs.ch.cam.ac.uk/pmr/](http://blogs.ch.cam.ac.uk/pmr/) worked for me
eventually, also
[http://webcache.googleusercontent.com/search?q=cache:_EBW_po...](http://webcache.googleusercontent.com/search?q=cache:_EBW_poxSLoJ:blogs.ch.cam.ac.uk/pmr/2014/04/02/acsgate-
pandora-opens-the-american-chemical-societys-box-and-her-university-gets-cut-
off/&hl=en&strip=0).

There are several posts about this issue,
[http://blogs.ch.cam.ac.uk/pmr/2014/04/03/acsgate-the-
america...](http://blogs.ch.cam.ac.uk/pmr/2014/04/03/acsgate-the-american-
chemical-society-spider-trap-reactions-and-warning/) appears to be the best
I've looked at as it gives details of the links that were followed that
initiated the suspension of service by ACS.

------
nathanvanfleet
I work at a university and just clicked on it.

------
notastartup
Trying to stop spidering or web scraping or making it criminal is asinine. Do
not publish it online. Even if you put content up as Flash or Java applet,
someone will find a way to crawl/scrape it.

This goes against the nature of the internet and information, it is bound to
be free.

~~~
logfromblammo
You cannot simultaneously publish something and stop people from knowing what
it contains. Expecting that you can is absolutely insane--literally insane--as
in believing that P and not P can simultaneously have the same logical value.

This makes me furious. It isn't because the intent is malicious. That only
makes me just a tiny bit angry. I am furious because the malice was
implemented in the stupidest, most useless, laziest manner possible.

It's like keeping the neighborhood kids off your lawn by burying a pressure
plate switch out there for the armed nuclear bomb in your garage. And then not
telling anyone about it. And then inviting all the neighbors over for a
croquet tournament.

