
Unsupervised word embeddings capture latent knowledge from scientific literature - KasianFranks
https://www.nature.com/articles/s41586-019-1335-8
======
gnomewascool
For the lazy, the doi is 10.1038/s41586-019-1335-8 if you want to add it to
your bibliographies. Obviously, don't use the doi for any illegal purposes,
such as getting around the paywall.

~~~
ovi256
For the even lazier, one can just add [https://sci-hub.tw/](https://sci-
hub.tw/) in front of the nature URL and through the great work of the sci-hub
people, get access to a web page that has the paper. It's an amazing UX!

~~~
argd678
And yet more lazy: [https://techxplore.com/news/2019-07-machine-learning-
algorit...](https://techxplore.com/news/2019-07-machine-learning-algorithms-
uncover-hidden-scientific.html)

------
solarist
An article about the paper by the first author
[https://towardsdatascience.com/using-unsupervised-machine-
le...](https://towardsdatascience.com/using-unsupervised-machine-learning-to-
uncover-hidden-scientific-knowledge-6a3689e1c78d)

------
iandanforth
Summary: Given abstracts of materials science papers they were able to predict
that certain materials would have desirable/interesting properties before
these materials were actually examined for those properties. This was
confirmed by "holding out" recent years of data and then seeing if predictions
from say 2009 would have held up today. They also have made predictions which
have yet to be confirmed / refuted.

Interesting points on future work:

\- This was only using abstracts. Using full papers could yield significant
improvements.

\- Uses word2vec and not Bert / Elmo, so there's likely to be another jump in
performance there.

------
moconnor
You can get an idea of the content from their GitHub
[https://github.com/materialsintelligence/mat2vec/blob/master...](https://github.com/materialsintelligence/mat2vec/blob/master/README.md)

The author emails are at the end of README.md if you still want to ask for a
preprint.

------
PeterStuer
If you do not have access to the Nature paper, this paper reports on the same
study.
[https://chemrxiv.org/articles/Named_Entity_Recognition_and_N...](https://chemrxiv.org/articles/Named_Entity_Recognition_and_Normalization_Applied_to_Large-
Scale_Information_Extraction_from_the_Materials_Science_Literature/8226068/1)

------
haddr
5-years old discovery, nothing spectacular (as of 2019). On the other hand, a
good example of publishing: code, corpora and materials are available for
everyone to reproduce it.

------
msamwald
This is certainly a nice paper, but it is also a bit puzzling that this was
noteworthy enough to be published in Nature.

~~~
roenxi
> an unsupervised method can recommend materials for functional applications
> several years before their discovery

It is probably hype, but if that sentence is taken literally it would be
_huge_.

The limits on human innovation have historically been chemical/materials
science related rather than a lack of imagination. Anything that allows search
to be deployed on things that don't even exist would be ... well, big.

~~~
tastroder
> It is probably hype, but if that sentence is taken literally it would be
> huge.

It's not really that hype but it's also neither that novel and results in this
type of domain still have to be verified through other means. $foo2vec papers
have been doing this for several domains, framed as text retrieval and link
prediction / knowledge base completion, for a few years now.

------
delton137
We published something similar in spirit recently (although it ended up as a
conference paper and not in Nature)... Notably, we did our study with much
fewer data - instead of millions of patents we had the text of a few thousand
patents and the text of a few hundred conference papers. We had a specific
focus and we wanted to focus on texts about energetic materials (explosives
and propellants).

We showed how chemical-application & chemical-property relations are captured
by word2vec and GloVe. For instance we found rocket fuels where the chemicals
appearing closest to “rocket” while materials used in air bags appeared
closest to “air bag”. We were able to filter to chemical names using
ChemDataExtractor and further to likely energetic chemicals by obtaining
SMILES strings from PubChem and using a classifier to classify them as likely
energetics or not.

You can find our work here :
[https://arxiv.org/pdf/1903.00415.pdf](https://arxiv.org/pdf/1903.00415.pdf) .

------
tastroder
Is the novel part the application to materials science? I can't get to the
nature paper on mobile but the analysis in the other resources linked here
looks pretty thorough.

Is there anything new methodology wise in the nature version?

------
smaddox
Do the authors have a draft pdf available?

------
tshitoyan
Hi All, glad to see our paper caught your attention. Here is a link to read
the paper: [https://rdcu.be/bItqk](https://rdcu.be/bItqk)

------
Der_Einzige
Between this and the UMAP paper on cancer publishing in Nature, I'm convinced
that my next publication will be in the sample place that Isaac Newton
published in

