An outrage. I know an eminent chemist who will not referee for Elsevier--he was ahead of his time. The monopoly that the scientific publishers have consists of controlled access to 60 years of copyrighted work. An embargo of a few years after publication would be considerably better than what we have, but the intellectual monopolists want those billions. The cost of publishing hasn't gone up--the money is going into their executive suites. And most likely to their executive's sweetie pies.
I'm not sure how a tool to read some text and display a diagram of the chemical reaction described falls under the "law" of the quoted passage. Copyright law certainly allows for this, so all the journals can do is say, "you don't get to buy our feed anymore if you run your program on our articles," but surely this is a game of chicken because no company wants to lose tens of thousands of dollars a year for no reason.
I would implement this as a browser plugin that uploads the content to a server (like Google Translate), and let the journals deal with each rogue user individually.
Telling people what software they can use to read text doesn't scale.
What would happen if you actually did try to textmine it.
surely this is a game of chicken because no company wants to lose tens of thousands of dollars a year for no reason.
Elsevier is willing to play this game of chicken; they are convinced that their customers (i.e., research universities) cannot do without their product. Hence the OP reporting that twice, they instantly and without warning cut off all access to their product to the entire university, because of detected scraping.
Telling people what software they can use to read text doesn't scale.
Unfortunately, I think it does. If individual users, even a large number of them, use a plug-in, they might get away with it-- but this would result in an incomplete data set. To systematically textmine the corpus (which is the task at hand) requires some kind of systematic access to the data, and this is where Elsevier steps in and shuts it down.
What actually happen with that guy? I remember all the mainstream media stories what kind of a bad ass hacker he was, but somewhat never caught up with the "happy end".
Not really a bad ending. The feds have no case against him, MIT and JSTOR aren't going to sue him, and JSTOR decided that releasing their archive was a good idea. Hopefully the FBI will drop their case.
35 years in prison is what you would get in federal court for walking across the street as the light was changing from yellow to red. It is there to give the government a better position to offer a plea agreement. It is highly unlikely that he would be convicted of every count, and even if he was, it is highly unlikely that he would receive the maximum sentence for each count.
But, 35 years is scary, and if I were in his shoes and the government said "pay a $100,000 fine", I'd probably agree without much argument. And that's exactly the point of saying he faces "up to" 35 years.
First: it's very difficult to grok copyright issues from a one-sided, non-legal presentation. Settled case law means going to trial and often going through multiple appeals.
Second: as it stands, this looks more to be a contractual than copyright matter (Elsevier is contractually banning use of automated text processing by subscribers), though it might possibly attempt to put teeth into this by asserting copyright. The remedy sought by Elsevier would be not to allege infringement, but to cancel future subscriptions. It might be nice for, say, a large collection of institutions to call Elsevier's bluff.
Third: Copyright law as it exists (and particularly in the US where I understand it reasonably well) is very mechanical: it governs the making of copies of an expressive work. Copyright does NOT govern facts, it does not apply to works which are not expressive, it does not apply to works which are functional in nature.
The real test here would be to put this (and possibly other) contract claims to test in a court. Unfortunately, contracts are governed (in the US) under state, not federal law, and while there's some uniformity of language, it would probably take several such cases (and appeals to at least the Federal Circuit) to establish reasonable case law.
Otherwise, what's significant about this to me is that, once again, it's a case far less about the availability and copying of information (journal articles are routinely copied), than it is about power and control within an information market. This is an area in which conventional economics is far too often lacking (though it's also an area in which much interesting work is starting to happen).
I think prevention of semantic extraction to diagrams would better pitched as "another unforeseen negative consequence of agreeing to Elsevier's terms", rather than as a significant part of the problem. If this story became a major part of the larger narrative, Elsevier could respond to this particular concern in a limited way (special access for researchers that apply for it), rather than dealing with the more central concerns.
people don't realize how quickly text mining is coming along these days.
the chemistry papers talked about in this article are an easy domain, but similar shredding of more general documents will be possible by 2018.
developing all this takes people's time and other resources, for which money is a proxy. the best way to beat Elsevier is to build something better -- but that something better is going to take substantial funding.
Scientific papers can be published online at a cost well under $35. Add bureaucracy to that, however, and the cost can increase to 100 times. Conventional funding agencies have a limited interest in long-term programs (as opposed to projects) so that organizations find it difficult to afford to maintain digital libraries after they are built. To bring the evil empire down, somebody needs to figure out how to get that $35.
I feel really sad about this as well. I actually happened to have published a few articles in Elsevier journals. I wish what I published is available for everyones eyes. The point of publishing it was to share my findings with the world not only with those who pay.
To make more a available and searchable I actually uploaded everything to academia.edu. I hope they don't get sued by Wiley and/or Elsevier for what the service they offer. If anyone wants to check out some chemistry you can check out what I have here http://unlv.academia.edu/AlexiNedeltchev
Few side thoughts: I think there are many things that can be improved in the science publishing:
1. Articles could be more interactive by providing discussion /commenting section.
2. Currently, if you want to see if the article you are checking out is worth reading you have to refer to the overall rating of the journal (this is known was impact factor). I think every article should have separate ratings. That way you can tell which are high impact articles and which are not. Btw the impact factors is calculated based on how many articles referred the article in hand. Does that sound familiar? It's the same concept as a webpage SEO. The more links pointing in the higher the rank.
3. Publishing process is every inefficient. It takes months to get something published since it was to be peer-reviewed. This obsolete approach that begs to be improved. Any ideas?
Repost of my comment on the blog (because waiting in the queue and not approved).
You should take the time to discuss a bit with your librarian. As I did my PhD in Denmark (DTU), I naively wrote a robot to download the issues of a well known chemical data journal. In about a week of balanced usage, I went to discuss with our librarian. He had seen my usage, was nice not to talk about it, but told me this: I downloaded more than the entire university in a year… and it was not a lot. It means that at that time, they paid a bit less than the $35 per article price.
What is really important to notice is that Elsevier are not selling knowledge for most of the scientific communities but influence. That is, you are published, cited, you get ranking and your university reward you. This is what we need to address if we want to have really open access. We need a better way to “sell” influence to the university researchers and deans.
As I am building Cheméo http://chemeo.com a chemical data search engine, I suffer too. It is maybe time to unit and propose a legal, efficient and rewarding way for the researchers to publish their papers. We can do that on the side and let our influence grow.
Additional notes for HN readers as yeah, we are a bit more on the programming side. What we basically need is a parallel DOI system easy to use, able to load all the open repositories and able to accept "direct" submissions.
We are not going to solve the problem in a year, this is an influence issue, it will take time, years, to really address it, be it by our own work or by "law".
How come your search system doesn't appear to know chemistry? That is, I searched for "CCO" and found ethyl alcohol, but I searched for "OCC" and found nothing. Are you only doing a text search on the SMILES rather than a canonicalization first?
Good point, this is on the way. The problem is more to do it right, that is, you want for each "word" to detect if this is a word, a chemical formula or a SMILES. Then, if SMILES, you then need to canonicalize it and search.
Nothing very complicated, but it needs to be well done to be of any use.
Plus, to detect if its a systemic, or semi-systemic name, and extract the structure from that. I know of three tools which do that, and only one is free.
If you want a SMILES detector, you can use my opensmiles-ragel grammar to detect if a word is syntactically correct. (Not grammatically correct; it will allow "c1C" unless you write code to require balanced parentheses and matching ring counts.)
This would be much faster than passing it to any of the cheminformatics toolkits to do the first level detection.
If you're doing similarity searches, you might be interested in my chemfp project.
This is precisely where the Copyright Office (and its UK equivalent} can step in and make explicit that textmining in this fashion is unambiguously fair use.
It's not the semantic interpretation of text that they're banning, but the scraping of text, which deals with copying what they'd prefer to sell you. Still an abhorrence, but let's get our facts straight.
Actually, it sounds like they are trying to sell their interpretation of a recipe because they recently acquired a company which extracts recipes. They've banned their customers from running the same tool which they probably use internally in order to protect their slice of the market.
Easy solution, create a human intelligence task distribution system, have every university that has access participate, assign lab students to perform the HIT of downloading the document. Voila problem solved.
After it's all been mined stop submitting to Elsvier.
Until the academia realizes that computers can read too, it's important for companies that DO have access to these papers (like google scholar[1]) to create 3rd party APIs so that we can at least have better search tools.
Just throwing it into the mix, what would it take to convince Google to stop indexing the content of journal articles from closed-access journals? Surely without search, the articles are siloed; both journals and authors get a taste of the importance of opening access.
Unlikely to happen, but in any case also unlikely to be all that effective. Google scholar is getting good, but when looking for papers I will usually use scopus/sciencedirect (Elsevier) or ISI (Thompson Reuters). Sowing the seeds of my own destruction and so forth.