Hacker News new | comments | ask | show | jobs | submit login
Wikipedia2Vec: Optimized Implementation for Learning Embeddings from Wikipedia (arxiv.org)
109 points by khabb 24 days ago | hide | past | web | favorite | 9 comments

You can see the visualization of the embedding vector space here: http://projector.tensorflow.org/?config=https://wikipedia2ve...

I recommend to use T-SNE instead of PCA, which can be selected by the button at the bottom left.

Is this the notion of "embedding" they are discussing?


What is an "entity"?

Author here. More broadly, embedding is a mapping from objects (e.g., words and entities) to vectors of real numbers. And as described in the mcxlog's comment, an entity refers to an entry in Wikipedia in this paper.

In this field, an entity typically refers to an entry in Wikipédia, such as a person, an organization, a place, etc (types which are considered may vary).

And yes that page is the embedding definition they are referring to.

Im a little confused, its just Word2vec pretrained on content of wikipedia.

Unlike Word2vec, this tool learns embeddings of entities (i.e., entries in Wikipedia) as well as words. And although the model implemented in this tool is based on Word2vec's skip-gram model, it is extended using two submodels (Wikipedia link graph model and anchor context model). Please refer to the documentation for details: https://wikipedia2vec.github.io/wikipedia2vec/intro/

Perfect timing for me. I've been working on a project that needs multilingual vectorizing and entity recognition. I've been using dbpedia and yago queries.

Given an input text, what would be a good way to extract a list of entities? The word sequence should be usable to determine which is an entity or just a word.

Is there the possibility to do fine tuning a la Bert or Elmo?

My past paper describes an entity linking method based on Wikipedia2Vec: https://arxiv.org/abs/1601.01343

You need to extract entity names using an NER software (e.g., SpaCy, Stanford NER), and resolve the names to knowledge base entities using the entity linking method.

You can use the TAGME API to do that, it’s state of the art as far as I know. If you want to implement it yourself I think it’s pretty painful, though.

No reason you can’t fine tune these on your task, that’s true of any word embeddings not just Bert or Elmo.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact