
A plan to mine the world’s research papers - qaute
https://www.nature.com/articles/d41586-019-02142-1
======
dlkf
> No one will be allowed to read or download work from the repository, because
> that would breach publishers’ copyright. Instead, Malamud envisages,
> researchers could crawl over its text and data with computer software,
> scanning through the world’s scientific literature to pull out insights
> without actually reading the text.

I find this totally unconvincing. The average scientific article isn't any
good, and the NLP algorithms that do tasks like this are even worse.

For scientific literature to be useful to those of us outside the academy, we
need to be able to see the full document - what methodology was employed, what
assumptions were made, how the data was gathered - just to be able to gauge
whether the authors had any idea what they were doing. Ideally we would also
be able to search over documents, explore the citation tree in some sort of
UI, and access articles in multiple formats (sometimes you want a PDF,
sometimes you want plaintext, and I imagine that the mathy-types might like
LaTeX source).

I applaud Malamud's efforts to overcome this problem, and I hope his results
prove me wrong. I just think it's sad that we have to resort to hacks like
this to overcome what is obviously enormous scam that is stealing our tax
dollars and stifling academic and economic creativity.

~~~
montecarl
> The average scientific article isn't any good, and the NLP algorithms that
> do tasks like this are even worse.

In my day job, I'm often tasked with implementing algorithms from recently
published physics papers. In order to do so, I normally have to read through
at least 10 related papers (both cited papers and cited by papers) in order to
have a clear idea in my head about what is going on. Even then, I often have
to discuss what I have read with 2 or 3 other people and try out many
different approaches before we finally understand what the paper meant.

This is because few papers include enough details to reproduce what they are
doing. Many of these papers have mistakes or typos in their equations. Many of
the math equations are also under specified. For example, an equation may list
a sum over an index, but then in the text there may be a whole paragraph that
describes what that index means (there is nothing wrong with this, but it
would make it hard for AI to "just use the equations").

At the end of this whole process, which can take a week, about half of the
time we choose not to implement the algorithm for one of several reasons:

* The authors misrepresented their work and it does not perform as well as claimed (e.g. the chosen examples are special cases that make their approach look better than the state-of-the-art approach)

* The work is not reproducible from the information in the paper

* The amount of work to implement it is far greater than an initial reading of the paper would suggest, due to additional details that were left out of their discussion

So, all of this is to conclude that understanding scientific papers is at the
very limit of human ability for a group of PhDs in the field. I do not think
that until we have much more powerful AI that it has any hope of making sense
of this mess.

edit: P.S. I am guilty of these same mistakes when publishing. I understand
the deadlines and pressure to publish that leads to these issues. It is a huge
amount of work to fully document and publish all the details needed to
reproduce some new algorithm .

~~~
fromthestart
>In my day job, I'm often tasked with implementing algorithms from recently
published physics papers. In order to do so, I normally have to read through
at least 10 related papers (both cited papers and cited by papers) in order to
have a clear idea in my head about what is going on. Even then, I often have
to discuss what I have read with 2 or 3 other people and try out many
different approaches before we finally understand what the paper meant

This post makes me feel slightly better about how often I come away from
reading a paper feeling like I only have a shallow understanding of the
content.

------
kodz4
> A trigger for this mission came from a landmark Delhi High Court judgment in
> 2016. The case revolved around Rameshwari Photocopy Services, a shop on the
> campus of the University of Delhi. For years, the business had been
> preparing course packs for students by photocopying pages from expensive
> textbooks. With prices ranging between 500 and 19,000 rupees (US$7–277),
> these textbooks were out of reach for many students. In 2012, Oxford
> University Press, Cambridge University Press and Taylor and Francis filed a
> lawsuit against the university, demanding that it buy a license to reproduce
> a portion of each text. But the Delhi High Court dismissed the suit. In its
> judgment, the court cited section 52 of India’s 1957 Copyright Act, which
> allows the reproduction of copyrighted works for education. Another
> provision in the same section allows reproduction for research purposes.

Good job India.

------
nafizh
It would be interesting to see the server getting hacked with all the papers
in the open, and then the researchers sending a sorry email to the publishers.

~~~
killjoywashere
And use the same legal theory as Equifax, Target, etc? It would be beautiful.

------
apo
> Over the past year, Malamud has — without asking publishers — teamed up with
> Indian researchers to build a gigantic store of text and images extracted
> from 73 million journal articles dating from 1847 up to the present day.

Maybe I missed it, but the article doesn't seem to explain exactly how
Malamud's group compiled its database.

Throttling is a major problem with the naive approach of throwing wget on a
publisher site. The publisher detects a bot on its network downloading
everything in sight and either slows data transfer to a trickle or just shuts
down access to it.

The publishers may not win on copyright, but they may try to make a case based
on criminality if Malamud's team actively took steps to circumvent throttling
and defeat the defenses of the hosting sites. Especially if the publisher
knows what to look for in its logs.

~~~
toomuchtodo
Scihub? Bonus points if your properly executed scraping project backfills
Scihub where it is missing DOIs.

------
jdjayded
I'm a little late to this party, but here's a mandatory research plug:

My lab works in the area of evidence based medicine. My research focuses on
Randomized Controlled Trials, and the overall goal is to automate (fully or
partially) the meta-analysis of medical interventions. To that end, we
collected a dataset of intervention pairs, statistical significance findings
about these pairs, and a minimal rationale supporting the significance
finding.

Since these annotations were collected on full text PubMed Open Access
articles, we can distribute both the articles and the annotations:
[https://github.com/jayded/evidence-
inference](https://github.com/jayded/evidence-inference) ; paper:
[https://arxiv.org/abs/1904.01606](https://arxiv.org/abs/1904.01606) ;
website: [http://evidence-inference.ebm-nlp.com/](http://evidence-
inference.ebm-nlp.com/)

We're working on collecting more complete annotation. We hope to facilitate
meta-analysis, evidence based medicine, and long document natural language
inference. We might even succeed (somewhat) since this is a very targeted
effort, as opposed to something more broad.

------
Merrill
This is good, more because it will make searching for information easier than
because it will avoid copyright. Searching through journal articles to find
information about a given topic is very hard, even at a university with more
or less universal library access to the online literature.

Authoring papers of uneven detail and quality and then publishing them in
whatever prestigious journal that will have them is a terrible way to document
the progress of science.

Instead, new results should be added to an open science information base,
fully linked to all previous results that bear upon the new results, either in
support or contradiction. This is what some of the bioinformatic data bases
attempt to achieve by scanning articles, but it would be better to omit the
article step.

------
aphextim
R.I.P. Aaron Swartz
[https://en.wikipedia.org/wiki/Aaron_Swartz](https://en.wikipedia.org/wiki/Aaron_Swartz)

------
modeless
So, sci-hub, but less accessible?

~~~
hyperbovine
Also, marginally less likely to suddenly vanish in a poof of lawsuit one day.
See Napster.

~~~
jacquesm
But storage costs have dropped by orders of magnitude since then and 'personal
copies' of SciHub are a definite possibility.

~~~
IronBacon
I've read somewhere a couple of years ago that SciHub was about 35TB of data,
quite big but still manageable.

~~~
jacquesm
It's a bit more now, going on 80T or so. Still quite manageable. A large NAS
will do.

~~~
m1el
I just happen to have a 80T NAS that I plan to use as a sci-hub mirror, is
there a way to download it? libgen torrents are dead.

~~~
jacquesm
email?

