
Leveraging machine learning to fuel new discoveries with the ArXiv dataset - lelf
https://blogs.cornell.edu/arxiv/2020/08/05/leveraging-machine-learning-to-fuel-new-discoveries-with-the-arxiv-dataset/
======
techbio
This is a long time coming. Glad that they've done the lifting to expose these
resources to higher-order analysis.

I took a look at this potential a couple of years ago but on PubMed:

[https://techbio.org/b-tracing-psych-signals-
lit.php](https://techbio.org/b-tracing-psych-signals-lit.php)

------
gabcoh
Any potential for something like this to be added to gpt-4's training data
set. Being far from an expert, I assume the figures and any non-text data
could pose a problem, but it also seems like providing it with a huge high-
quality source of scientific reasoning could lead to some pretty amazing and
revolutionary results. gpt inspiring research directions or even coauthoring.
Anyone who knows more about this have any thoughts? Have I just drunk the gpt
cool aid or is their some potential for it to revolutionize science quite
soon?

~~~
phreeza
My money is on the next iteration of gpt being multi-modal (See image-gpt). So
this kind of thing would fit right in. I don't think this would lead to the
kind of scientific revolution you are thinking of, given the tendency of gpt
to confabulate things instead of basing itself on facts. That may be entering
cool-aid territory :)

------
mellosouls
For anybody on mobile thinking the link has redirected to the index, you need
to scroll way down to read the actual article.

~~~
ajflores1604
Thank you

------
physicsgraph
The value of content tagging (e.g., PhysML) and keyword tagging (e.g.,
ScienceWise) is apparent in aggregate, like for searching. That benefit is to
consumers while the burden is currently on the content creators.

I don't know of any incentives within academia or grant processes that would
motivate content authors to tag their content. With the exception of creators
of the tagging systems. That means bulk analysis (whether using a grammar or
natural language machine learning method) is key.

Citation graphs (which is mostly what's been done previously[0]) pale in
comparison to text analysis. The possibility of enabling complex searches
would be a big leap forward in science.

[0]
[https://physicsderivationgraph.blogspot.com/2020/05/literatu...](https://physicsderivationgraph.blogspot.com/2020/05/literature-
review-for-using-arxiv-as.html)

------
physicsgraph
There have been efforts to tag keywords in arXiv [0] and to identify sections
of articles [1]. A conference on Mathematical Knowledge Management [2] was
held last week; some participants have already been analyzing ArXiv.

Hopefully integration with Kaggle expands the number of teams taking advantage
of the knowledge in the corpus.

[0] [http://sciencewise.info/](http://sciencewise.info/) [1]
[https://github.com/OMdoc/OMDoc/wiki/PhysML](https://github.com/OMdoc/OMDoc/wiki/PhysML)
[2] [https://cicm-conference.org/2020/cicm.php](https://cicm-
conference.org/2020/cicm.php)

------
iandanforth
This is really cool. I worked on the AI Index Report and had to bug karpathy
to get his copy of arxiv papers to do analysis. (He's been slowly collecting
papers for arxiv sanity for years). Getting them all from the arxiv API would
have taken months.

This will enable tons of useful stats gathering about the fields represented
on arxiv. Hopefully it will also lead to new scientific insights as well!

~~~
newman8r
arXiv also offers bulk access to the papers, you can download them via S3
'requester pays'

[https://arxiv.org/help/bulk_data](https://arxiv.org/help/bulk_data)

------
vansul
Very cool, I really look forward to seeing science move forward along these
paths. All the same - have to post the obligatory (and recent) xkcd rebuttal-
[https://xkcd.com/2341/](https://xkcd.com/2341/)

~~~
canjobear
I'm a scientist and my problems are a lot more like the first panel than the
second.

~~~
anchpop
I think this is the second time Randall has made that joke:
[https://xkcd.com/1831/](https://xkcd.com/1831/) . I wonder if he has a grudge
against data scientists or something

~~~
thecupisblue
I think it's aimed at people who make assumptions that all other fields are
just a subfield of their field

