
Show HN: Wikipedia2Vec – A tool for learning embeddings of words and entities - ikuyamada
https://wikipedia2vec.github.io/
======
lightbyte
If you're wondering what the point of this is when word2vec wikipedia pre-
trained embeddings are already easily found

>Wikipedia2Vec is based on the Word2vec's skip-gram model that learns to
predict neighboring words given each word in corpora. We extend the skip-gram
model by adding the following two submodels:

>The link graph model that learns to estimate neighboring entities given an
entity in the link graph of Wikipedia entities.

>The anchor context model that learns to predict neighboring words given an
entity by using a link that points to the entity and its neighboring words.

>By jointly optimizing the skip-gram model and these two submodels, our model
simultaneously learns the embedding of words and entities from Wikipedia. For
further details, please refer to our paper: Joint Learning of the Embedding of
Words and Entities for Named Entity Disambiguation.

------
alexcnwy
Very cool!

I'm currently busy with my MSc thesis on learning to link plain text documents
to semantically relevant Wikipedia articles (and some of the cool machine
learning-y things you can do from there).

I have 2 questions about your work:

1\. I'm not sure if you're familiar with Doc2Vec but it allows you to train a
Word2Vec model while also learning a vector for each document in the training
corpus. Wikipedia is commonly used as a training corpus so you can get a
"DocVec" for each Wikipedia article in the same vector space as your Word2Vec
model (i.e. the DocVec for the Wikipedia page "Machine Learning" is nearby the
WordVec for "mathematics"). Did you consider/compare with using Doc2Vec to
learn a vector for each Wikipedia page and then use those as your entity
vectors?

2\. Your "Features" page says you convert an entity name to a link pointing to
an entity if the entity name is unambiguou". In the case of ambiguous entities
from a link (which happens often - my research is only learning links to
Wikipedia articles from plain text documents), did you consider using the
entity vectors (or some simple model built on top of the word vectors of the
target page) to disambiguate?

~~~
ikuyamada
Thanks :) 1) I think learning entity embeddings using the Doc2Vec (paragraph
vector) model is an interesting idea, but we did not test it. 2) This tool was
initially developed to address the entity linking task. Mapping words and
entities into a same vector space enables to model the contextual information
that is useful for entity linking. For details, please refer to this paper:
Joint Learning of the Embedding of Words and Entities for Named Entity
Disambiguation:
[https://arxiv.org/pdf/1601.01343.pdf](https://arxiv.org/pdf/1601.01343.pdf)

------
gitgud
Off Topic:

If you click this post link, then click "Features" on the website, the browser
'back' button breaks. Clicking back multiple times doesn't get you back to
Hacker News.

Can anyone explain what's going on here? Seems pretty annoying.

------
rpedela
This is awesome! Is there a plan to support all the major embedding
algorithms?

~~~
ikuyamada
Regarding word embedding algorithm, I am interested in supporting other models
that uses subword information (e.g., Fasttext). Further, there have been
proposed various recent models to learn entity representations from KB, and I
plan to work on them.

------
m1sta_
Is there any obvious reason why all entities that have a wikipedia article
associated with them don't appear as entities in this output?

~~~
ikuyamada
What kind of output did you mean? Wikipedia2Vec learns embeddings of entities
which have links from other articles more than _min-entity-count_ times.

------
wyldfire
How portable are tools like these among different ontologies/knowledge bases?

~~~
ikuyamada
The current code is written specifically for Wikipedia. However, its algorithm
is portable for knowledge bases that contains articles and their entity
annotations.

------
riku_iki
Curious if there are any benchmarks against Fasttext, Elmo, etc..

~~~
ikuyamada
We did not add Fasttext to our benchmarks because of a minor technical issue
but we will work on it. Further, to conduct a fair comparison with ELMo, I
think it is needed to use _extrinsic_ tasks such as question answering and
textual entailment.

~~~
riku_iki
You can build very simple 100 lines benchmark e.g. doing IMDB dataset
classification, and use ELMO and your embeddings, and see who will perform
better..

~~~
ikuyamada
Thank you for your feedback! I am also interested in conducting experiments on
extrinsic tasks such as text classification. In addition to word embeddings,
Wikipedia2Vec also contains entity embeddings which are likely beneficial for
these tasks, so I would like to design a model that uses both the word
embeddings and entity embeddings.

------
stared
Its name is misleading. It's word2vec based on Wikipedia words, not - vectors
for Wikipedia pages (cf node2vec, food2vec, game2vec, etc).

~~~
ikuyamada
Please note that similar to other approaches (e.g., node2vec), Wikipedia2Vec
learns embeddings for Wikipedia entities in addition to embeddings for words.

