>> What is the dimensionality of each word vector and what does a words position in this space "mean"? What is this dimensionality determined by?
Each dimension roughly is a new way that words can be similar or dissimilar. So I've got 1000-dimensional vectors, so words can be similar or dissimilar in only one thousand 'ways'. So associations like 'luxury', 'thoughtful', 'person', 'place' or 'object' are learned (roughly speaking). Of course, real words are far more diverse, so this is an approximation. The 1000 dimensions is configurable, and in theory more dimensions means more contrast is captured, but you need more training data. In practice, the number 1000 is chosen because that maxes out the size of my large memory machine. That said, the word2vec paper shows good results with 1000D, so it doesn't seem to be a bad choice.
>> Have your tried any dimensionality reduction algorithms like PCA or Isomap?
Yes! I've tried out PCA, and some spectral biclustering using the off-the-shelf algorithms in SciKits Learn. I only played around with this for an hour or so but got discouraging results. Nevertheless, the word2vec papers actually show that this works really well for projecting France, USA, Paris, DC, London, etc. on a two-dimensional plane where the axis roughly correspond to countried & capitals -- exactly what you'd hope for! I wasn't able to replicate that, but Tomas Mikolov was!
>> It would be interesting to find the word vectors that contain the most variation across all of wikipedia.
Hmm, interesting indeed! I'm not sure how I'd got about measuring 'variation' -- would this amount to isolating word clusters and finding the most dense ones? Something like finding a cluster with a hundred variations of the word 'snow' (if you're Inuit)? I'd be willing to part with the raw vector database if there's interest.
>> Have you tried any other nearest neighbor search methods other than a simple dot product, such as locality sensitive hashing?
Only a little bit, although I'm very interested in finding a faster approach than finding the whole damn dot product (see: https://news.ycombinator.com/item?id=6720359). I worry that traditional location sensitive hashes, kd-trees, and the like work well for 3D locations, but miserably for 1000D data like I have here.
I should reiterate out that most of the hard work revolves around the word2vec algorithm which I used but didn't write. It's awesome, check it and the papers out here: https://code.google.com/p/word2vec/
Build your kd-tree on the vectors expressed in the eigenvector basis. If the eigenvalues decrease fast enough, you can get bounds on the dot product while going only a few levels deep.
Don't most search engines use an inverted index to find the similarity between the query vector and the document vectors? (instead of doing the dot product with every document)
>> What is the dimensionality of each word vector and what does a words position in this space "mean"? What is this dimensionality determined by?
Each dimension roughly is a new way that words can be similar or dissimilar. So I've got 1000-dimensional vectors, so words can be similar or dissimilar in only one thousand 'ways'. So associations like 'luxury', 'thoughtful', 'person', 'place' or 'object' are learned (roughly speaking). Of course, real words are far more diverse, so this is an approximation. The 1000 dimensions is configurable, and in theory more dimensions means more contrast is captured, but you need more training data. In practice, the number 1000 is chosen because that maxes out the size of my large memory machine. That said, the word2vec paper shows good results with 1000D, so it doesn't seem to be a bad choice.
>> Have your tried any dimensionality reduction algorithms like PCA or Isomap?
Yes! I've tried out PCA, and some spectral biclustering using the off-the-shelf algorithms in SciKits Learn. I only played around with this for an hour or so but got discouraging results. Nevertheless, the word2vec papers actually show that this works really well for projecting France, USA, Paris, DC, London, etc. on a two-dimensional plane where the axis roughly correspond to countried & capitals -- exactly what you'd hope for! I wasn't able to replicate that, but Tomas Mikolov was!
>> It would be interesting to find the word vectors that contain the most variation across all of wikipedia.
Hmm, interesting indeed! I'm not sure how I'd got about measuring 'variation' -- would this amount to isolating word clusters and finding the most dense ones? Something like finding a cluster with a hundred variations of the word 'snow' (if you're Inuit)? I'd be willing to part with the raw vector database if there's interest.
>> Have you tried any other nearest neighbor search methods other than a simple dot product, such as locality sensitive hashing?
Only a little bit, although I'm very interested in finding a faster approach than finding the whole damn dot product (see: https://news.ycombinator.com/item?id=6720359). I worry that traditional location sensitive hashes, kd-trees, and the like work well for 3D locations, but miserably for 1000D data like I have here.
I should reiterate out that most of the hard work revolves around the word2vec algorithm which I used but didn't write. It's awesome, check it and the papers out here: https://code.google.com/p/word2vec/
Whoa, that was alot. Thanks!