Hacker News new | past | comments | ask | show | jobs | submit login

Peter Murray Rust (author of this blog post) is a really great man. He's been a tireless advocate for dismantling privelege and setting knowledge free for several decades. I'm proud to say he's becoming a sort of mentor to me. Last week I spent a couple of days with his research group and saw this software in action - it's really impressive.

They can take an ancient paper with very low quality diagrams of complex chemical structures, parse the image into an open markup language and reconstruct the chemical formula and the correct image. Chemical symbols are just one of many plugins for their core software which interprets unstructured, information rich data like raster diagrams. They also have plugins for phylogenetic trees, plots, species names, gene names and reagents. You can develop plugins easily for whatever you want, and they're recruiting open source contributors (see https://solvers.io/projects/QADhJNcCkcKXfiCQ6, https://solvers.io/projects/4K3cvLEoHQqhhzBan).

As a side effect of how their software works, it can detect tiny suggestive imperfections in images that reveal scientific fraud. I was shown a demo where a trace from a mass spec (like this http://en.wikipedia.org/wiki/File:ObwiedniaPeptydu.gif) was analysed. As well as reading the data from the plot, it revealed a peak that had been covered up with a square - the author had deliberately obscured a peak in their data that was inconvenient. Scientific fraud. It's terrifying that they find this in most chemistry papers they analyse.

Peter's group can analyse thousands or hundreds of thousands of papers an hour, automatically detecting errors and fraud and simultaneously making the data, which are facts and therefore not copyrightable, free. This is one of the best things that has happened to science in many years, except that publishers deliberately prevent it. Their work also made me realise it would be possible to continue Aaron Swartz' work on a much bigger scale (http://blahah.net/2014/02/11/knowledge-sets-us-free/).

Academic publishers who are suppressing this are literally the enemies of humanity.

> Academic publishers are literally the enemies of humanity.

I would not be so extreme: they used to be a necessary evil before the Internet became mainstream. They are more of a millstone from the past that research still drags with it because changing an institution is hard.

>They are more of a millstone from the past that research still drags with it because changing an institution is hard.

It's not that changing is hard, since it would be easy if that's what everyone wanted. It's that you have many competing forces (perhaps that's what you mean by "hard"). The most important is scientists who made their careers in the current system--the people who publish yearly in Nature, for example. They have tremendous political sway and it is not in their interest to change. They will even admit that it is not an ideal system, but no one likes giving up being King.

Edit: I should point out that it is more complicated. Science does need a way to judge which scientists are doing valuable work and which aren't, and currently Impact Factors are one of the main ways this is done now. We don't yet have a great alternative.

I don't think anyone but the traditional publishers wants to keep traditional publishers.

Every scientist I've ever met wants as many people as possible to read (and cite) their work. But the main issue is that the main metric for scientists' performance is number of publications in prestigious journals, and the oldest journals tend to be the most prestigious ones. If a scientist wants to get grant funding or promotions, they have to put up with the copyright terms of the traditional publishers.

Any real change to the system has to start from the government agencies and councils handing out grant funding to research groups. As long as they prefer to fund people with Nature papers over SomeOpenAccessJournal papers, only a few scientists are willing to sacrifice their career for the principle of open access.

Altmetrics is a growing field - I think in the next couple of years we will have an ecosystem of tools (like impactstory.org, altmetric.com) much better and more rounded than impact factors. I also think impact factors are poisonous in their own right.

What proponents of "open access science" fail to realize is that science is a business like any other, and a very big one involving lots of money. Like in any business, some form of marketing and reputation management is key. Academic publishers provide essential marketing services: filtering, sorting, and branding. You self publish, and most people regard you as a kook. You publish in Nature, and suddenly you're someone worth reading. Branding, credentialing, signaling, all these things are just as important if not more so in science than in any other field, and that's the service academic publishers provide.

At the end of the day, nobody is holding a gun to the heads of professors forcing them to publish in these journals. These are freely-negotiated agreements between researchers and publishers. The reason researchers continue to support the existing system is that it serves their purposes and furthers their careers. Any alternative system will have to provide similar benefits.

> is a business like any other

If it's a business, then it's a business full of tyrannic monopolies. Science publishers are essentially leeching public money, because government funding goes only to people who publish through those publishers. That is a perpetual loop that is impossible to break, since scientists are incentivized (with public money) to publish through established journals. That is not 'clean' business like any other.

OTOH, open access doesn't mean that one has to do away with all the established journals, just that they have to make their content open to everyone. There are other ways to cover their operating costs, like charging for publishing (a better model, in which scientists have an incentive to choose the journal that offers the best value/money).

None of this explains why they should get the copyright.

They negotiate with researchers for it.

Just like I negotiate with my hospital for medical care payment.

Life lesson for US HNers who may never have been in this situation: you can negotiate with healthcare providers on price. It is actually fairly routine. They won't think less of you.

70 cents on the dollar delivered immediately in cash beats the heck out of 100 cents on the dollar transferred to a debt-collection agency, worked for 6 months, and negotiated down to 60 cents of which the agency keeps 15.

>ou self publish, and most people regard you as a kook. You publish in Nature, and suddenly you're someone worth reading.

Nature (and others) are (in theory) peer reviewed journals with lends a sense of credibility that self publishing doesn't have.

Edited to be more specific - it's only the publishers who are systematically restricting knowledge that are at fault. But no matter where they've come from, what they are doing now is holding humanity back on a massive scale.

It's a narrative that plays very well on HN, that <insert industry here> are not only obsolete but evil and what's more stupid and we - the enlightened software developers - will free humanity from their shackles. But the world is a good deal more complicated than that. And not least because what most software people here do wouldn't really be missed if it went away. Ask Altavista and Myspace about that.

Of course the world is more complicated than that - but just saying it's more complicated is not a constructive contribution.

We can talk about problems we perceive and try to solve them, or we can sit around saying things are complicated. Let's do the first one.

That ground has been covered many times. Hotels it turns out do things that AirBnB doesn't, for good reasons. Taxis it turns out do things that Uber doesn't. Yadda yadda. My critique is intended of the theme, not any specific instance of it.

You probably consider yourself neither stupid nor evil - try to remember this when it's your turn to be "disrupted"...

Well, then I agree in abstract. I'm not suggesting we disrupt publishing, or try to save it. We (loosely the open science movement) are trying to fix our own field. Publishers serve a purpose, but some of what they do is harming us. We're encouraging the ones that can be encouraged to change. We'll force the ones that wont be encouraged.

You're right - I don't consider myself stupid or evil. But if I ever do become those things and it hurts others, whether I realise it or not, I hope someone forces me to stop.

> Ask Altavista and Myspace about that.

One particular search engine or one particular social network wouldn't be missed if gone, but I bet search engines and social networks would be missed if they all went away completely (and it would also be pretty bad if the Internet shut down entirely).

And like traditional publishers there is still a role they can play, edition and synchronisation of anonymous review remains an important element of paper publication.

They theoretically could play this role, but others are much better at it. These companies are structured to exploit scientists and the public, not to enable them. Innovation is stifled both within and without. We're much better off turning to fresh approaches to peer review, for example any of the things on this list:


This is not done by the publishers. They only provide an online system to do so. The editors and reviewers are other scientists.

> This is not done by the publishers.

The work itself obviously is not[0], but you need somebody to synchronise that stuff, especially to ensure reviews remain anonymous (to paper authors). This must be handled by a third party.

[0] not entirely true for edition, publishers generally employ a number of editors although these editors usually have a solid science background for obvious reasons.

Aw come on now, they do perform a service that's of nonzero value - they (ostensibly) sift the wheat from the chafe. I don't have the expertise to know what's valid research and what's hokum, but I know if I read it in Nature, I can at least take it seriously.

Admittedly I believe we could develop a much better system for doing this then academic publishing, but I don't honestly know what such a system would look like.

Uh, you do know that the editing and reviewing are normally done for these journals by mostly government supported academics who don't charge for their time don't you? The academic journal business has devolved into tax payer dollar farming with the usual campaign $$$ payoffs to establish and maintain monopolies.

The fraud detection aspect is truly intriguing. The motivation for the fraud could well be seeing other people's fraudulent papers and comparing yourself to their impossible data, and making the change in a desperate move for acceptance.

Academia as a whole has many diseases, but the emergence of the scientific academic world as a unified hive mind, which thanks to antics like this gets increasingly out of touch with reality, is seriously detrimental to progress.

>Peter's group can analyse thousands or hundreds of thousands of papers an hour, automatically detecting errors and fraud and simultaneously making the data, which are facts and therefore not copyrightable, free. //

Facts aren't copyrightable but their presentation is. If they're copying the presentation of those facts in to a computer for commercial purposes (free release of the content of those papers is certainly commercial) then in the UK they'd be committing copyright infringement. There is also a possibility of infringing on EU database rights.

The OP link says:

>"Note that chemical structure diagrams are NOT creative works."

Whilst a plot or diagram may appear at first to lack "art" the presentation is formatted and certainly a "sweat of the brow" analysis supports it being copyrighted.

You can't scan someone else's molecular diagrams but you could look at them and enter the data yourself. Just like a chart, you can't duplicate the chart except by looking at it and copying the information - the presentation of which has been filtered away. Duplication in a computer produces an exact copy and AFAICT under UK legislation that would be infringing [silly as it is as the end result is the same it just takes more work].

Relying on only duplicating the image transitively in to non-persistent memory might satisfy the requirements not to make a copy. That would be an interesting test-case. How the law has been construed to allow the use of search engines would probably help here.

>take an ancient paper //

I'm curious what license you'd acquire the papers under that allows their duplication in this manner. Even ancient papers are often controlled by the library they're stored in or by those that performed their digitisation.

[FWIW I strongly disprove of this sort of the strength and duration of currently granted copyright]

I agree with all your analysis. In this case the presentation is discarded and, as I see it, only the facts are retained.

For example in parsing chemical structure diagrams what is recorded are things like the number of atoms in a section of a molecule, their angles and what kind of bonds exist to neighbouring molecules. These data are then analysed to generate the formula and re-construct a correct diagram.

There are no licenses to my knowledge on older papers that allow publishing the analysis (but within the University we have agreements with JSTOR, for example). Looking at public domain stuff is OK. This is part of the issue - that knowledge should be a public good.

If you're a lawyer I'm sure Peter would appreciate hearing your opinion.

I would think that the conversion from a rasterized scan to an "open markup language" would be sufficient to count as a new presentation, no?

>conversion from a rasterized scan to an "open markup language" //

[Skip to the end!]


The problem is that a rasterised scan is a new - potentially unauthorised - copy. UK law tends to be more restrictive as we don't have the same sense of "Fair Use" as 17 USC.

Art 5.1 of the EU Copyright Directive (2001/29/EC; Section 28A of the UK CDP Act) at Section 1(b) allows for transient copies to be made when the copying is part of an otherwise allowed act. But the stipulation is that the copy can't have "economic significance".

Here the rasterisation then would appear to fail, even if it can be considered a transient part of the transfer of the information from copyright diagram to free-libre ML.

This to me - as a non-expert [though I consider myself pretty well read on copyright] - means that the conversion needs to be made without making an intermediate copy. A manual process would bypass the problem of making a copy for a computer program to analyse at the expense of lots of human input.

This is where things get silly as the end result is the same - the extraction of information from a catalogue of molecular diagrams - the process is just made more expensive. I rather hope my analysis is wrong actually and that a court would rule that scanning such works would be allowable in order to extract the informational content; would love to have more input here.


... actually further looking at S.28A(b) makes me think I am wrong; that this should be allowed. I'm convinced that the copies made aren't independently commercially significant and that as the process of extracting the information from the diagrams is an allowed use then the "transient copy" legislation makes this allowable.

IA[of course]NAL.

Crucially, here the rasterised scan is made by the publishers, and whoever runs the analysis software is allowed to access the digital image.

I'm not sure that is crucial - I'll bet there are terms associated with the allowance of access to the already rasterised images along the lines of "solely for individual reading".

However I'll leave it there as it's too complex an issue to address generalities rather than the specific nature of the inputs, processing and analysis and intended uses, commercial aspects and such.

I pray every blessing on your knowledge sharing endeavours.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact