
UK chemist on Elsevier's ban on textmining - czam
http://blogs.ch.cam.ac.uk/pmr/2011/11/25/the-scandal-of-publisher-forbidden-textmining-the-vision-denied/
======
ChristianMarks
An outrage. I know an eminent chemist who will not referee for Elsevier--he
was ahead of his time. The monopoly that the scientific publishers have
consists of controlled access to 60 years of copyrighted work. An embargo of a
few years after publication would be considerably better than what we have,
but the intellectual monopolists want those billions. The cost of publishing
hasn't gone up--the money is going into their executive suites. And most
likely to their executive's sweetie pies.

------
jrockway
I'm not sure how a tool to read some text and display a diagram of the
chemical reaction described falls under the "law" of the quoted passage.
Copyright law certainly allows for this, so all the journals can do is say,
"you don't get to buy our feed anymore if you run your program on our
articles," but surely this is a game of chicken because no company wants to
lose tens of thousands of dollars a year for no reason.

I would implement this as a browser plugin that uploads the content to a
server (like Google Translate), and let the journals deal with each rogue user
individually.

Telling people what software they can use to read text doesn't scale.

~~~
michael_dorfman
What would happen if you actually did try to textmine it. _surely this is a
game of chicken because no company wants to lose tens of thousands of dollars
a year for no reason._

Elsevier is willing to play this game of chicken; they are convinced that
their customers (i.e., research universities) cannot do without their product.
Hence the OP reporting that twice, they instantly and without warning cut off
all access to their product to the entire university, because of detected
scraping.

 _Telling people what software they can use to read text doesn't scale._

Unfortunately, I think it does. If individual users, even a large number of
them, use a plug-in, they might get away with it-- but this would result in an
incomplete data set. To systematically textmine the corpus (which is the task
at hand) requires some kind of systematic access to the data, and this is
where Elsevier steps in and shuts it down.

~~~
jrockway
Possibility number two is to break in to a router closet at MIT to do the
scraping :)

~~~
stfu
What actually happen with that guy? I remember all the mainstream media
stories what kind of a bad ass hacker he was, but somewhat never caught up
with the "happy end".

~~~
dangrossman
<http://en.wikipedia.org/wiki/Aaron_Swartz#JSTOR>

~~~
jrockway
Not really a bad ending. The feds have no case against him, MIT and JSTOR
aren't going to sue him, and JSTOR decided that releasing their archive was a
good idea. Hopefully the FBI will drop their case.

~~~
dangrossman
35 years in prison would be a pretty bad ending, and he's still being
prosecuted for that.

~~~
jrockway
35 years in prison is what you would get in federal court for walking across
the street as the light was changing from yellow to red. It is there to give
the government a better position to offer a plea agreement. It is highly
unlikely that he would be convicted of every count, and even if he was, it is
highly unlikely that he would receive the maximum sentence for each count.

But, 35 years is scary, and if I were in his shoes and the government said
"pay a $100,000 fine", I'd probably agree without much argument. And that's
exactly the point of saying he faces "up to" 35 years.

------
chwahoo
I think prevention of semantic extraction to diagrams would better pitched as
"another unforeseen negative consequence of agreeing to Elsevier's terms",
rather than as a significant part of the problem. If this story became a major
part of the larger narrative, Elsevier could respond to this particular
concern in a limited way (special access for researchers that apply for it),
rather than dealing with the more central concerns.

~~~
PaulHoule
people don't realize how quickly text mining is coming along these days.

the chemistry papers talked about in this article are an easy domain, but
similar shredding of more general documents will be possible by 2018.

developing all this takes people's time and other resources, for which money
is a proxy. the best way to beat Elsevier is to build something better -- but
that something better is going to take substantial funding.

Scientific papers can be published online at a cost well under $35. Add
bureaucracy to that, however, and the cost can increase to 100 times.
Conventional funding agencies have a limited interest in long-term programs
(as opposed to projects) so that organizations find it difficult to afford to
maintain digital libraries after they are built. To bring the evil empire
down, somebody needs to figure out how to get that $35.

------
billswift
Here is the follow-up he mentioned: Textmining: My years negotiating with
Elsevier, [http://blogs.ch.cam.ac.uk/pmr/2011/11/27/textmining-my-
years...](http://blogs.ch.cam.ac.uk/pmr/2011/11/27/textmining-my-years-
negotiating-with-elsevier/)

------
alexi_dst
I feel really sad about this as well. I actually happened to have published a
few articles in Elsevier journals. I wish what I published is available for
everyones eyes. The point of publishing it was to share my findings with the
world not only with those who pay.

To make more a available and searchable I actually uploaded everything to
academia.edu. I hope they don't get sued by Wiley and/or Elsevier for what the
service they offer. If anyone wants to check out some chemistry you can check
out what I have here <http://unlv.academia.edu/AlexiNedeltchev>

Few side thoughts: I think there are many things that can be improved in the
science publishing: 1\. Articles could be more interactive by providing
discussion /commenting section. 2\. Currently, if you want to see if the
article you are checking out is worth reading you have to refer to the overall
rating of the journal (this is known was impact factor). I think every article
should have separate ratings. That way you can tell which are high impact
articles and which are not. Btw the impact factors is calculated based on how
many articles referred the article in hand. Does that sound familiar? It's the
same concept as a webpage SEO. The more links pointing in the higher the rank.
3\. Publishing process is every inefficient. It takes months to get something
published since it was to be peer-reviewed. This obsolete approach that begs
to be improved. Any ideas?

------
Loic
Repost of my comment on the blog (because waiting in the queue and not
approved).

You should take the time to discuss a bit with your librarian. As I did my PhD
in Denmark (DTU), I naively wrote a robot to download the issues of a well
known chemical data journal. In about a week of balanced usage, I went to
discuss with our librarian. He had seen my usage, was nice not to talk about
it, but told me this: I downloaded more than the entire university in a year…
and it was not a lot. It means that at that time, they paid a bit less than
the $35 per article price.

What is really important to notice is that Elsevier are not selling knowledge
for most of the scientific communities but influence. That is, you are
published, cited, you get ranking and your university reward you. This is what
we need to address if we want to have really open access. We need a better way
to “sell” influence to the university researchers and deans.

As I am building Cheméo <http://chemeo.com> a chemical data search engine, I
suffer too. It is maybe time to unit and propose a legal, efficient and
rewarding way for the researchers to publish their papers. We can do that on
the side and let our influence grow.

Additional notes for HN readers as yeah, we are a bit more on the programming
side. What we basically need is a parallel DOI system easy to use, able to
load all the open repositories and able to accept "direct" submissions.

We are not going to solve the problem in a year, this is an influence issue,
it will take time, years, to really address it, be it by our own work or by
"law".

~~~
dalke
How come your search system doesn't appear to know chemistry? That is, I
searched for "CCO" and found ethyl alcohol, but I searched for "OCC" and found
nothing. Are you only doing a text search on the SMILES rather than a
canonicalization first?

~~~
Loic
Good point, this is on the way. The problem is more to do it right, that is,
you want for each "word" to detect if this is a word, a chemical formula or a
SMILES. Then, if SMILES, you then need to canonicalize it and search.

Nothing very complicated, but it needs to be well done to be of any use.

Thanks for the feedback!

~~~
dalke
Plus, to detect if its a systemic, or semi-systemic name, and extract the
structure from that. I know of three tools which do that, and only one is
free.

If you want a SMILES detector, you can use my opensmiles-ragel grammar to
detect if a word is syntactically correct. (Not grammatically correct; it will
allow "c1C" unless you write code to require balanced parentheses and matching
ring counts.)

This would be much faster than passing it to any of the cheminformatics
toolkits to do the first level detection.

If you're doing similarity searches, you might be interested in my chemfp
project.

------
estevez
This is precisely where the Copyright Office (and its UK equivalent} can step
in and make explicit that textmining in this fashion is unambiguously fair
use.

------
aswanson
Google "Swanson linking". This. ban. is probably impeding discovery along that
vector as well.

------
aba_sababa
It's not the semantic interpretation of text that they're banning, but the
scraping of text, which deals with copying what they'd prefer to sell you.
Still an abhorrence, but let's get our facts straight.

~~~
politician
Actually, it sounds like they are trying to sell their interpretation of a
recipe because they recently acquired a company which extracts recipes.
They've banned their customers from running the same tool which they probably
use internally in order to protect their slice of the market.

------
fleitz
Easy solution, create a human intelligence task distribution system, have
every university that has access participate, assign lab students to perform
the HIT of downloading the document. Voila problem solved.

After it's all been mined stop submitting to Elsvier.

------
guard-of-terra
Why can they do that and be listened to? Why not do the textmining anyway and
say "so sue me"?

------
7952
So I can't use Ctrl-F?

------
JulianMorrison
End copyright.

------
arguesalot
Until the academia realizes that computers can read too, it's important for
companies that DO have access to these papers (like google scholar[1]) to
create 3rd party APIs so that we can at least have better search tools.

1\. [http://code.google.com/p/google-ajax-
apis/issues/detail?id=1...](http://code.google.com/p/google-ajax-
apis/issues/detail?id=109)

------
arguesalot
Just throwing it into the mix, what would it take to convince Google to stop
indexing the content of journal articles from closed-access journals? Surely
without search, the articles are siloed; both journals and authors get a taste
of the importance of opening access.

~~~
flashingleds
Unlikely to happen, but in any case also unlikely to be all that effective.
Google scholar is getting good, but when looking for papers I will usually use
scopus/sciencedirect (Elsevier) or ISI (Thompson Reuters). Sowing the seeds of
my own destruction and so forth.

