Wikipedia2Vec: Optimized Implementation for Learning Embeddings from Wikipedia

ikuyamada · on Dec 23, 2018

You can see the visualization of the embedding vector space here: http://projector.tensorflow.org/?config=https://wikipedia2ve...

I recommend to use T-SNE instead of PCA, which can be selected by the button at the bottom left.

jessriedel · on Dec 23, 2018

Is this the notion of "embedding" they are discussing?

https://en.m.wikipedia.org/wiki/Word_embedding

What is an "entity"?

ikuyamada · on Dec 23, 2018

Author here. More broadly, embedding is a mapping from objects (e.g., words and entities) to vectors of real numbers. And as described in the mcxlog's comment, an entity refers to an entry in Wikipedia in this paper.

mcxlog · on Dec 23, 2018

In this field, an entity typically refers to an entry in Wikipédia, such as a person, an organization, a place, etc (types which are considered may vary).

And yes that page is the embedding definition they are referring to.

rasz · on Dec 23, 2018

Im a little confused, its just Word2vec pretrained on content of wikipedia.

ikuyamada · on Dec 23, 2018

Unlike Word2vec, this tool learns embeddings of entities (i.e., entries in Wikipedia) as well as words. And although the model implemented in this tool is based on Word2vec's skip-gram model, it is extended using two submodels (Wikipedia link graph model and anchor context model). Please refer to the documentation for details: https://wikipedia2vec.github.io/wikipedia2vec/intro/

crucialfelix · on Dec 23, 2018

Perfect timing for me. I've been working on a project that needs multilingual vectorizing and entity recognition. I've been using dbpedia and yago queries.

Given an input text, what would be a good way to extract a list of entities? The word sequence should be usable to determine which is an entity or just a word.

Is there the possibility to do fine tuning a la Bert or Elmo?

ikuyamada · on Dec 24, 2018

My past paper describes an entity linking method based on Wikipedia2Vec: https://arxiv.org/abs/1601.01343

You need to extract entity names using an NER software (e.g., SpaCy, Stanford NER), and resolve the names to knowledge base entities using the entity linking method.

currymj · on Dec 23, 2018

You can use the TAGME API to do that, it’s state of the art as far as I know. If you want to implement it yourself I think it’s pretty painful, though.

No reason you can’t fine tune these on your task, that’s true of any word embeddings not just Bert or Elmo.