They can take an ancient paper with very low quality diagrams of complex chemical structures, parse the image into an open markup language and reconstruct the chemical formula and the correct image. Chemical symbols are just one of many plugins for their core software which interprets unstructured, information rich data like raster diagrams. They also have plugins for phylogenetic trees, plots, species names, gene names and reagents. You can develop plugins easily for whatever you want, and they're recruiting open source contributors (see https://solvers.io/projects/QADhJNcCkcKXfiCQ6, https://solvers.io/projects/4K3cvLEoHQqhhzBan).
As a side effect of how their software works, it can detect tiny suggestive imperfections in images that reveal scientific fraud. I was shown a demo where a trace from a mass spec (like this http://en.wikipedia.org/wiki/File:ObwiedniaPeptydu.gif) was analysed. As well as reading the data from the plot, it revealed a peak that had been covered up with a square - the author had deliberately obscured a peak in their data that was inconvenient. Scientific fraud. It's terrifying that they find this in most chemistry papers they analyse.
Peter's group can analyse thousands or hundreds of thousands of papers an hour, automatically detecting errors and fraud and simultaneously making the data, which are facts and therefore not copyrightable, free. This is one of the best things that has happened to science in many years, except that publishers deliberately prevent it. Their work also made me realise it would be possible to continue Aaron Swartz' work on a much bigger scale (http://blahah.net/2014/02/11/knowledge-sets-us-free/).
Academic publishers who are suppressing this are literally the enemies of humanity.
I would not be so extreme: they used to be a necessary evil before the Internet became mainstream. They are more of a millstone from the past that research still drags with it because changing an institution is hard.
It's not that changing is hard, since it would be easy if that's what everyone wanted. It's that you have many competing forces (perhaps that's what you mean by "hard"). The most important is scientists who made their careers in the current system--the people who publish yearly in Nature, for example. They have tremendous political sway and it is not in their interest to change. They will even admit that it is not an ideal system, but no one likes giving up being King.
Edit: I should point out that it is more complicated. Science does need a way to judge which scientists are doing valuable work and which aren't, and currently Impact Factors are one of the main ways this is done now. We don't yet have a great alternative.
Every scientist I've ever met wants as many people as possible to read (and cite) their work. But the main issue is that the main metric for scientists' performance is number of publications in prestigious journals, and the oldest journals tend to be the most prestigious ones. If a scientist wants to get grant funding or promotions, they have to put up with the copyright terms of the traditional publishers.
Any real change to the system has to start from the government agencies and councils handing out grant funding to research groups. As long as they prefer to fund people with Nature papers over SomeOpenAccessJournal papers, only a few scientists are willing to sacrifice their career for the principle of open access.
At the end of the day, nobody is holding a gun to the heads of professors forcing them to publish in these journals. These are freely-negotiated agreements between researchers and publishers. The reason researchers continue to support the existing system is that it serves their purposes and furthers their careers. Any alternative system will have to provide similar benefits.
If it's a business, then it's a business full of tyrannic monopolies. Science publishers are essentially leeching public money, because government funding goes only to people who publish through those publishers. That is a perpetual loop that is impossible to break, since scientists are incentivized (with public money) to publish through established journals. That is not 'clean' business like any other.
OTOH, open access doesn't mean that one has to do away with all the established journals, just that they have to make their content open to everyone. There are other ways to cover their operating costs, like charging for publishing (a better model, in which scientists have an incentive to choose the journal that offers the best value/money).
70 cents on the dollar delivered immediately in cash beats the heck out of 100 cents on the dollar transferred to a debt-collection agency, worked for 6 months, and negotiated down to 60 cents of which the agency keeps 15.
Nature (and others) are (in theory) peer reviewed journals with lends a sense of credibility that self publishing doesn't have.
We can talk about problems we perceive and try to solve them, or we can sit around saying things are complicated. Let's do the first one.
You probably consider yourself neither stupid nor evil - try to remember this when it's your turn to be "disrupted"...
You're right - I don't consider myself stupid or evil. But if I ever do become those things and it hurts others, whether I realise it or not, I hope someone forces me to stop.
One particular search engine or one particular social network wouldn't be missed if gone, but I bet search engines and social networks would be missed if they all went away completely (and it would also be pretty bad if the Internet shut down entirely).
The work itself obviously is not, but you need somebody to synchronise that stuff, especially to ensure reviews remain anonymous (to paper authors). This must be handled by a third party.
 not entirely true for edition, publishers generally employ a number of editors although these editors usually have a solid science background for obvious reasons.
Admittedly I believe we could develop a much better system for doing this then academic publishing, but I don't honestly know what such a system would look like.
Academia as a whole has many diseases, but the emergence of the scientific academic world as a unified hive mind, which thanks to antics like this gets increasingly out of touch with reality, is seriously detrimental to progress.
Facts aren't copyrightable but their presentation is. If they're copying the presentation of those facts in to a computer for commercial purposes (free release of the content of those papers is certainly commercial) then in the UK they'd be committing copyright infringement. There is also a possibility of infringing on EU database rights.
The OP link says:
>"Note that chemical structure diagrams are NOT creative works."
Whilst a plot or diagram may appear at first to lack "art" the presentation is formatted and certainly a "sweat of the brow" analysis supports it being copyrighted.
You can't scan someone else's molecular diagrams but you could look at them and enter the data yourself. Just like a chart, you can't duplicate the chart except by looking at it and copying the information - the presentation of which has been filtered away. Duplication in a computer produces an exact copy and AFAICT under UK legislation that would be infringing [silly as it is as the end result is the same it just takes more work].
Relying on only duplicating the image transitively in to non-persistent memory might satisfy the requirements not to make a copy. That would be an interesting test-case. How the law has been construed to allow the use of search engines would probably help here.
>take an ancient paper //
I'm curious what license you'd acquire the papers under that allows their duplication in this manner. Even ancient papers are often controlled by the library they're stored in or by those that performed their digitisation.
[FWIW I strongly disprove of this sort of the strength and duration of currently granted copyright]
For example in parsing chemical structure diagrams what is recorded are things like the number of atoms in a section of a molecule, their angles and what kind of bonds exist to neighbouring molecules. These data are then analysed to generate the formula and re-construct a correct diagram.
There are no licenses to my knowledge on older papers that allow publishing the analysis (but within the University we have agreements with JSTOR, for example). Looking at public domain stuff is OK. This is part of the issue - that knowledge should be a public good.
If you're a lawyer I'm sure Peter would appreciate hearing your opinion.
[Skip to the end!]
The problem is that a rasterised scan is a new - potentially unauthorised - copy. UK law tends to be more restrictive as we don't have the same sense of "Fair Use" as 17 USC.
Art 5.1 of the EU Copyright Directive (2001/29/EC; Section 28A of the UK CDP Act) at Section 1(b) allows for transient copies to be made when the copying is part of an otherwise allowed act. But the stipulation is that the copy can't have "economic significance".
Here the rasterisation then would appear to fail, even if it can be considered a transient part of the transfer of the information from copyright diagram to free-libre ML.
This to me - as a non-expert [though I consider myself pretty well read on copyright] - means that the conversion needs to be made without making an intermediate copy. A manual process would bypass the problem of making a copy for a computer program to analyse at the expense of lots of human input.
This is where things get silly as the end result is the same - the extraction of information from a catalogue of molecular diagrams - the process is just made more expensive. I rather hope my analysis is wrong actually and that a court would rule that scanning such works would be allowable in order to extract the informational content; would love to have more input here.
... actually further looking at S.28A(b) makes me think I am wrong; that this should be allowed. I'm convinced that the copies made aren't independently commercially significant and that as the process of extracting the information from the diagrams is an allowed use then the "transient copy" legislation makes this allowable.
However I'll leave it there as it's too complex an issue to address generalities rather than the specific nature of the inputs, processing and analysis and intended uses, commercial aspects and such.
I pray every blessing on your knowledge sharing endeavours.