Hacker News new | past | comments | ask | show | jobs | submit login
Machines are better referees than humans but we’ll be sued if we use them (cam.ac.uk)
300 points by inglesp on Feb 19, 2014 | hide | past | web | favorite | 61 comments

Peter Murray Rust (author of this blog post) is a really great man. He's been a tireless advocate for dismantling privelege and setting knowledge free for several decades. I'm proud to say he's becoming a sort of mentor to me. Last week I spent a couple of days with his research group and saw this software in action - it's really impressive.

They can take an ancient paper with very low quality diagrams of complex chemical structures, parse the image into an open markup language and reconstruct the chemical formula and the correct image. Chemical symbols are just one of many plugins for their core software which interprets unstructured, information rich data like raster diagrams. They also have plugins for phylogenetic trees, plots, species names, gene names and reagents. You can develop plugins easily for whatever you want, and they're recruiting open source contributors (see https://solvers.io/projects/QADhJNcCkcKXfiCQ6, https://solvers.io/projects/4K3cvLEoHQqhhzBan).

As a side effect of how their software works, it can detect tiny suggestive imperfections in images that reveal scientific fraud. I was shown a demo where a trace from a mass spec (like this http://en.wikipedia.org/wiki/File:ObwiedniaPeptydu.gif) was analysed. As well as reading the data from the plot, it revealed a peak that had been covered up with a square - the author had deliberately obscured a peak in their data that was inconvenient. Scientific fraud. It's terrifying that they find this in most chemistry papers they analyse.

Peter's group can analyse thousands or hundreds of thousands of papers an hour, automatically detecting errors and fraud and simultaneously making the data, which are facts and therefore not copyrightable, free. This is one of the best things that has happened to science in many years, except that publishers deliberately prevent it. Their work also made me realise it would be possible to continue Aaron Swartz' work on a much bigger scale (http://blahah.net/2014/02/11/knowledge-sets-us-free/).

Academic publishers who are suppressing this are literally the enemies of humanity.

> Academic publishers are literally the enemies of humanity.

I would not be so extreme: they used to be a necessary evil before the Internet became mainstream. They are more of a millstone from the past that research still drags with it because changing an institution is hard.

>They are more of a millstone from the past that research still drags with it because changing an institution is hard.

It's not that changing is hard, since it would be easy if that's what everyone wanted. It's that you have many competing forces (perhaps that's what you mean by "hard"). The most important is scientists who made their careers in the current system--the people who publish yearly in Nature, for example. They have tremendous political sway and it is not in their interest to change. They will even admit that it is not an ideal system, but no one likes giving up being King.

Edit: I should point out that it is more complicated. Science does need a way to judge which scientists are doing valuable work and which aren't, and currently Impact Factors are one of the main ways this is done now. We don't yet have a great alternative.

I don't think anyone but the traditional publishers wants to keep traditional publishers.

Every scientist I've ever met wants as many people as possible to read (and cite) their work. But the main issue is that the main metric for scientists' performance is number of publications in prestigious journals, and the oldest journals tend to be the most prestigious ones. If a scientist wants to get grant funding or promotions, they have to put up with the copyright terms of the traditional publishers.

Any real change to the system has to start from the government agencies and councils handing out grant funding to research groups. As long as they prefer to fund people with Nature papers over SomeOpenAccessJournal papers, only a few scientists are willing to sacrifice their career for the principle of open access.

Altmetrics is a growing field - I think in the next couple of years we will have an ecosystem of tools (like impactstory.org, altmetric.com) much better and more rounded than impact factors. I also think impact factors are poisonous in their own right.

What proponents of "open access science" fail to realize is that science is a business like any other, and a very big one involving lots of money. Like in any business, some form of marketing and reputation management is key. Academic publishers provide essential marketing services: filtering, sorting, and branding. You self publish, and most people regard you as a kook. You publish in Nature, and suddenly you're someone worth reading. Branding, credentialing, signaling, all these things are just as important if not more so in science than in any other field, and that's the service academic publishers provide.

At the end of the day, nobody is holding a gun to the heads of professors forcing them to publish in these journals. These are freely-negotiated agreements between researchers and publishers. The reason researchers continue to support the existing system is that it serves their purposes and furthers their careers. Any alternative system will have to provide similar benefits.

> is a business like any other

If it's a business, then it's a business full of tyrannic monopolies. Science publishers are essentially leeching public money, because government funding goes only to people who publish through those publishers. That is a perpetual loop that is impossible to break, since scientists are incentivized (with public money) to publish through established journals. That is not 'clean' business like any other.

OTOH, open access doesn't mean that one has to do away with all the established journals, just that they have to make their content open to everyone. There are other ways to cover their operating costs, like charging for publishing (a better model, in which scientists have an incentive to choose the journal that offers the best value/money).

None of this explains why they should get the copyright.

They negotiate with researchers for it.

Just like I negotiate with my hospital for medical care payment.

Life lesson for US HNers who may never have been in this situation: you can negotiate with healthcare providers on price. It is actually fairly routine. They won't think less of you.

70 cents on the dollar delivered immediately in cash beats the heck out of 100 cents on the dollar transferred to a debt-collection agency, worked for 6 months, and negotiated down to 60 cents of which the agency keeps 15.

>ou self publish, and most people regard you as a kook. You publish in Nature, and suddenly you're someone worth reading.

Nature (and others) are (in theory) peer reviewed journals with lends a sense of credibility that self publishing doesn't have.

Edited to be more specific - it's only the publishers who are systematically restricting knowledge that are at fault. But no matter where they've come from, what they are doing now is holding humanity back on a massive scale.

It's a narrative that plays very well on HN, that <insert industry here> are not only obsolete but evil and what's more stupid and we - the enlightened software developers - will free humanity from their shackles. But the world is a good deal more complicated than that. And not least because what most software people here do wouldn't really be missed if it went away. Ask Altavista and Myspace about that.

Of course the world is more complicated than that - but just saying it's more complicated is not a constructive contribution.

We can talk about problems we perceive and try to solve them, or we can sit around saying things are complicated. Let's do the first one.

That ground has been covered many times. Hotels it turns out do things that AirBnB doesn't, for good reasons. Taxis it turns out do things that Uber doesn't. Yadda yadda. My critique is intended of the theme, not any specific instance of it.

You probably consider yourself neither stupid nor evil - try to remember this when it's your turn to be "disrupted"...

Well, then I agree in abstract. I'm not suggesting we disrupt publishing, or try to save it. We (loosely the open science movement) are trying to fix our own field. Publishers serve a purpose, but some of what they do is harming us. We're encouraging the ones that can be encouraged to change. We'll force the ones that wont be encouraged.

You're right - I don't consider myself stupid or evil. But if I ever do become those things and it hurts others, whether I realise it or not, I hope someone forces me to stop.

> Ask Altavista and Myspace about that.

One particular search engine or one particular social network wouldn't be missed if gone, but I bet search engines and social networks would be missed if they all went away completely (and it would also be pretty bad if the Internet shut down entirely).

And like traditional publishers there is still a role they can play, edition and synchronisation of anonymous review remains an important element of paper publication.

They theoretically could play this role, but others are much better at it. These companies are structured to exploit scientists and the public, not to enable them. Innovation is stifled both within and without. We're much better off turning to fresh approaches to peer review, for example any of the things on this list:


This is not done by the publishers. They only provide an online system to do so. The editors and reviewers are other scientists.

> This is not done by the publishers.

The work itself obviously is not[0], but you need somebody to synchronise that stuff, especially to ensure reviews remain anonymous (to paper authors). This must be handled by a third party.

[0] not entirely true for edition, publishers generally employ a number of editors although these editors usually have a solid science background for obvious reasons.

Aw come on now, they do perform a service that's of nonzero value - they (ostensibly) sift the wheat from the chafe. I don't have the expertise to know what's valid research and what's hokum, but I know if I read it in Nature, I can at least take it seriously.

Admittedly I believe we could develop a much better system for doing this then academic publishing, but I don't honestly know what such a system would look like.

Uh, you do know that the editing and reviewing are normally done for these journals by mostly government supported academics who don't charge for their time don't you? The academic journal business has devolved into tax payer dollar farming with the usual campaign $$$ payoffs to establish and maintain monopolies.

The fraud detection aspect is truly intriguing. The motivation for the fraud could well be seeing other people's fraudulent papers and comparing yourself to their impossible data, and making the change in a desperate move for acceptance.

Academia as a whole has many diseases, but the emergence of the scientific academic world as a unified hive mind, which thanks to antics like this gets increasingly out of touch with reality, is seriously detrimental to progress.

>Peter's group can analyse thousands or hundreds of thousands of papers an hour, automatically detecting errors and fraud and simultaneously making the data, which are facts and therefore not copyrightable, free. //

Facts aren't copyrightable but their presentation is. If they're copying the presentation of those facts in to a computer for commercial purposes (free release of the content of those papers is certainly commercial) then in the UK they'd be committing copyright infringement. There is also a possibility of infringing on EU database rights.

The OP link says:

>"Note that chemical structure diagrams are NOT creative works."

Whilst a plot or diagram may appear at first to lack "art" the presentation is formatted and certainly a "sweat of the brow" analysis supports it being copyrighted.

You can't scan someone else's molecular diagrams but you could look at them and enter the data yourself. Just like a chart, you can't duplicate the chart except by looking at it and copying the information - the presentation of which has been filtered away. Duplication in a computer produces an exact copy and AFAICT under UK legislation that would be infringing [silly as it is as the end result is the same it just takes more work].

Relying on only duplicating the image transitively in to non-persistent memory might satisfy the requirements not to make a copy. That would be an interesting test-case. How the law has been construed to allow the use of search engines would probably help here.

>take an ancient paper //

I'm curious what license you'd acquire the papers under that allows their duplication in this manner. Even ancient papers are often controlled by the library they're stored in or by those that performed their digitisation.

[FWIW I strongly disprove of this sort of the strength and duration of currently granted copyright]

I agree with all your analysis. In this case the presentation is discarded and, as I see it, only the facts are retained.

For example in parsing chemical structure diagrams what is recorded are things like the number of atoms in a section of a molecule, their angles and what kind of bonds exist to neighbouring molecules. These data are then analysed to generate the formula and re-construct a correct diagram.

There are no licenses to my knowledge on older papers that allow publishing the analysis (but within the University we have agreements with JSTOR, for example). Looking at public domain stuff is OK. This is part of the issue - that knowledge should be a public good.

If you're a lawyer I'm sure Peter would appreciate hearing your opinion.

I would think that the conversion from a rasterized scan to an "open markup language" would be sufficient to count as a new presentation, no?

>conversion from a rasterized scan to an "open markup language" //

[Skip to the end!]


The problem is that a rasterised scan is a new - potentially unauthorised - copy. UK law tends to be more restrictive as we don't have the same sense of "Fair Use" as 17 USC.

Art 5.1 of the EU Copyright Directive (2001/29/EC; Section 28A of the UK CDP Act) at Section 1(b) allows for transient copies to be made when the copying is part of an otherwise allowed act. But the stipulation is that the copy can't have "economic significance".

Here the rasterisation then would appear to fail, even if it can be considered a transient part of the transfer of the information from copyright diagram to free-libre ML.

This to me - as a non-expert [though I consider myself pretty well read on copyright] - means that the conversion needs to be made without making an intermediate copy. A manual process would bypass the problem of making a copy for a computer program to analyse at the expense of lots of human input.

This is where things get silly as the end result is the same - the extraction of information from a catalogue of molecular diagrams - the process is just made more expensive. I rather hope my analysis is wrong actually and that a court would rule that scanning such works would be allowable in order to extract the informational content; would love to have more input here.


... actually further looking at S.28A(b) makes me think I am wrong; that this should be allowed. I'm convinced that the copies made aren't independently commercially significant and that as the process of extracting the information from the diagrams is an allowed use then the "transient copy" legislation makes this allowable.

IA[of course]NAL.

Crucially, here the rasterised scan is made by the publishers, and whoever runs the analysis software is allowed to access the digital image.

I'm not sure that is crucial - I'll bet there are terms associated with the allowance of access to the already rasterised images along the lines of "solely for individual reading".

However I'll leave it there as it's too complex an issue to address generalities rather than the specific nature of the inputs, processing and analysis and intended uses, commercial aspects and such.

I pray every blessing on your knowledge sharing endeavours.

When I asked my journalist friend, why in football (soccer) games the ref don't use high-tech, he thought about it for 5 minutes and then told me: "If they use technology it will be really hard to set up games. If you take from a league the ability to set-up games and promote specific teams/individuals, then I don't know how the game will be shaped".

Of course it's universal, it's not like everything is a set-up but happens more often than most would likely imagine, especially since betting came into play.

So there you got it.

The reasoning for not using hi-tech refereeing equipment in football is apparently a desire to keep the game played at the top level the same as that played in streets and fields the world over.

The friction of imperfect decisions is also considered part of the drama of the game, rather than a flaw.

> The reasoning for not using hi-tech refereeing equipment in football is apparently a desire to keep the game played at the top level the same as that played in streets and fields the world over.

That's already gone with goal line technology. In fact, it was never the case - how many street games/underage kids games even have assistant refs?

> The friction of imperfect decisions is also considered part of the drama of the game, rather than a flaw.

That very much depends on who you ask. Personally, I'd like to see some technology come in (in high profile games played in stadiums already equipped with tv cameras) to help get rid of the more ridiculous refereeing errors.

The most comical example I can think of contrived, convoluted FIFA rules is the Zidane red card in the World Cup final 2006[1]. Allegedly an assistant ref saw Zidane headbutt another player on a tv monitor, then alerted the ref, who red-carded Zidane. Technically the ref didn't follow the rules, as to send someone off the ref or assistants have to have seen the incident in real time. This means the average football fan with only basic knowledge of the rules has a better view of controversial, game changing decisions in high profile games than the ref (fans can watch instant replays in slo mo whereas the ref has to watch at full speed, sometimes with an obstructed view of an incident.)

[1] http://www.football-italia.net/42269/ref-who-saw-zidane-hit-...

Why would a high-tech system not be able to make biased calls (if that is desired by whoever is running it)? Machines may not be susceptible to bribes, but the people running/programming/calibrating them certainly are.

Presumably the machine flags WHY it's an error, and as long as that reasoning is included in the rebuttal, humans, or other machines, are free to disagree with it. Science.

Are you implying that authors and journals are conspiring to push through to publication papers with obvious flaws for some sort of monetary gain?

...or did you not read the article?

Actually they are experimenting with goal line technology at the world cup eg did that ball go over the line or not.

There is a lot of doggy stuff going on with football betting in the far east - which makes your comment " If you take from a league the ability to set-up games " is a little worrying.

This should be supported (both financially and ideologically) by the National Library of Medicine at the National Institutes of Health. The NIH doles out about $30 billion in research grants every year. If they could spend a tiny fraction of a percent to dramatically improve the quality of the rest and make such automatic checking a standard practice that would be tremendous bang for the buck.

Oh yeah -- and they're big enough to fight academic publishers.

Can they release the software to the world? Maybe, if we all make an effort to analyse whatever papers we can access, we will together make enough noise that it will be impossible to ignore, and also impossible to silence (cf. The Pirate Bay). This could be one of the most important advancements of science in the past few years.

This is great work, another fantasy of mine made reality and posted to HN!

Is their a tutorial for getting started with OSCAR? A "HOWTO" for analyzing a paper would make this program more accessible. If I could learn how to use it without spending too much time doing so, I would use it as tool for reviewing manuscripts. I would also like to use it on my own manuscripts to find mistakes.

I have a feature request: optical pattern recognition of mathematical formulas. It would be awesome to feed a program a pdf and have all of the mathematical formulas translated to LaTeX.

At first I thought the article would be about sports, which in itself would make for an interesting discussion about using machines to judge rules adherence, not that I would want to take that human element out of sports.

However this is more along the lines of validating what is published. Of any group you would hope that scientist and their like would jump on technology like this so as to provide the most accurate representation of their work as possible. The same for publishers, why wouldn't they want to brag the use the most advanced interrogation methods for the papers they publish?

I guess they are people too, hyper sensitive that fault will be found

> I guess they are people too, hyper sensitive that fault will be found

Or straight up fraud. Or a dismantling of their monopoly of information/data.

So as a non-scientist, let me see if I understand.

There are lots of uncaught errors floating around out there in scientific papers, and many of them can now be found with this software. But the exposing the errors so that they can be corrected is tricky because: A) you have to have legal access to a paper in order to scan it, and B) even if you do have access, under the current rules only the publishers have the right expose the errors, and they're not interested because they want to avoid the embarrassment.

Am I understanding it?

I see a very exciting possibility for the future of academic papers in certain disciplines where we could have a machine validation step performed automatically, not only on submission but as a tool for the author to check their work. Like a git commit hook that forces a test suite to run. Of course, this would require some formalism to tag data, diagrams, and formulae but it's probably in our best interest in the long run to make the body of our research more machine-accessible anyway.

I have a hard time seeing how anything but a teeny tiny fraction of scientific results would ever be amenable to such an automatic checking. And I am a mathematician -- in principle this should be easiest in mathematics, since at least we have well defined axioms and in principle one could derive everything from those axioms. In practice, this seems completely unfeasible for most mathematics, at least currently.

I'm guessing you're not alone, I would go so far as to say that validation software would offend most authors - the same way code validation tools hurt programmers' feelings.

"But I'm right and the tool is wrong! The tool doesn't understand the complexity/brilliance of my work!" And sometimes you're going to be right with that assertion. Other times, however, it will push you and your reviewers towards better quality.

I find it fascinating that the entire article is in fact about this issue, the copyright thing is purely incidental.

I think that's built on a massive assumption about the nature of academic papers tbh. I'm a political scientist doing theoretical work on assemblages of governance and policy - part of my research is trying to come up with an explanation as to how this stuff works, it doesn't boil down to formulae or (many) metrics, although I'm doing methodological work to address that. Accessible, sure, just don't solve a problem that doesn't apply to ~75% of academia!

I knew this would come up, that's why I wrote

> [...] in certain disciplines [...]

which is also exactly the use case the article refers to: for example chemistry papers.

My bad, I was skim reading in between library sessions, ironically. Could have done with some sort of automated check there on my post, if only someone had suggested something like that... rolls eyes

For those curious, the 5 membered ring in cyclopiazonic acid should have a NH atom rather than a CH2.

When people talk about the future, they always seem to think that it will be the scientific jobs that get roboticized last. I think it will be the opposite, it won't be long before systems like this one will be able to analyze the scientific literature, identify shortcomings, and tell us what experiments to do next. Science will become less about creativity and problem solving, and more about following directions; eventually becoming completely automated.


Any chance you could farm out the software to lab in a nationality with MUCH MUCH looser copyright laws, and a court system that would be problematic for outside law suits?

That's what I was thinking. Find someone who isn't under such a restrictive licence, and let them feed in the data.

I presume this is what will happen.

I suppose one way around this would be the NSF to require any grant awardees to deposit their structures in a publicly accessible database...But, I'm a bit surprised--is there nothing like arxiv.org for chemistry? Why not?

There is of course a way around the problems cited in the article.

If the referees ran the software on the preprint it would find the same problem.

I agree this isn't as good, but it would be a step forward.

I think the dream would be to couple a literature-analyzer like this with a specialized search engine like Wolfram Alpha.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact