
Making Sense of Everything with words2map - lmcinnes
http://blog.yhat.com/posts/words2map.html
======
minimaxir
I'm not fond of the "magic AI does everything" narrative, especially since the
code is available on GitHub ([https://github.com/overlap-
ai/words2map](https://github.com/overlap-ai/words2map)) and it's not magic.
That being said, the code is optimized for efficient memory usage (important
with the pre-built word2vec models), and since it MIT-licensed, I might be
able to develop a few pretty visualizations. :)

~~~
legel
Thanks for pointing the memory optimizations out. The Google model trained on
100 billion words unzips as a 3.4 GB binary, which often is a non-starter on
smaller servers, while these vectors indeed come in at less than 100 MB.
Mainly this is because only the most common 100k of the 3m vectors distributed
are in here, but also because 32 bit floats for each of 300 elements per
vector are converted to 16 bit floats, with virtually no loss in precision.
(Research suggests we may be able to compress even further, perhaps to 5 bits
of entropy per vector element.) In practice, across a wide variety of words
and phrases, this pipeline can quickly derive new vectors by web searching,
and thanks to the OP's HDBSCAN, can cluster them quite well. Typically it's
the quality of keywords found online that ends up most limiting the quality of
maps derived here, while there's probably room for breakthroughs in the
keyword extraction algorithm put together here (which relies on word2vec
indexes as a proxy for the idf in tf-idf). In any case if greater precision is
needed - like when I wanted to map ~50 really great scientists who deserve all
the credit here - one trick that can help is to increase the number of
websites to scan per unknown word (e.g. from 10 to 50) in the
research_keywords function.

~~~
mdda
Embeddings can be compressed even further than that :
[https://arxiv.org/abs/1511.06397](https://arxiv.org/abs/1511.06397)
(disclaimer: I'm the author)

~~~
legel
Wow - 3 bits per element. Wonderful work with the quantization, I'd love to
implement this. How would you recommend I proceed?

~~~
mdda
The paper (with fewer typos) was actually accepted into 'ICONIP' in Japan in
October - so I'll definitely have code on GitHub by the end of the summer.
Currently my Theano implementation is buried in typical exploratory kind of
code, which just needs to be stripped away to make something functional from
GloVE->Sparse in one command.

The NNSE paper has associated code already, but I found setting the sparseness
preference parameter was very hit-and-miss, which is why I preferred the
explicit sparse-by-percentage measure in my work.

~~~
legel
~6% sparsity + the way that you think about data representation is very
interesting. I will have to have a shot at running the autoencoder over the
Mikolov 3m word corpus. My goal is to get the first 1m words compressed at
under 100 MB zipped, including all indexes, which currently allows us to
distribute the vectors for free at github without paying for data transfer.
(To date about 8000 words have been mapped through the words2map Google API,
while I haven't really begun to do anything interesting here yet...)

------
mdda
"We are now at a point in history when algorithms can learn, like people,
about pretty much anything. " seems pretty disingenuously worded.

One infers from a quick read ~"Algorithms are now like people, and can learn
about anything." But careful parsing of the commas shows that the sentence is
true, but in the precise sense that "People can learn about anything. Now,
algorithms can also learn about anything." \- and the extent of
learning/understanding is not being compared.

Perhaps I'm nit-picking, but this statement appears to have been constructed
to support an AI pitch, and is literally true, but no 'actual AI' is involved
(and no-one is actually claiming it is... unless you /want to believe/).

~~~
legel
The message is genuine and simple: computers can now learn like humans.

~~~
mdda
It is a genuine message. It is also simple. But it isn't actually true.
Computers are not able to learn like humans : the learning process is entirely
different. And what they learn is different - even though it can be visualized
in a way that makes it look 'humanlike'.

These techniques are impressive, and yhat is demonstrating that they are very
capable. It's just that I feel a little sad that the 'AI pitch' is being
turned on, when the 'really good tech' is a much more valid way to understand
what they're doing.

~~~
legel
Thanks for the 'really good tech' comment, I'll take that on HN any day. I'm
sure we could agree on many structural differences in human and machine
learning, and in that sense, I'd agree that it makes sense to be clearer about
them.

------
ilyaeck
Question to Y-hat folks: why cluster in 2D? Granted, clustering in 300D is
hard :) Still, the 2D projection must add a significant metric distortion. Why
not a middle ground, say, 5-10D ?

~~~
minimaxir
Unless you are Mister Mxyzptlk, you cannot plot visualizations greater than
2-3D.

~~~
sixhobbits
parent is asking about this I think: "This is indeed nice for data
visualization, while it’s also very helpful in our pipeline because it removes
noise in the derived vectors, by forcing a new mapping based purely on
relative similarity. For this reason we will be using the low-dimensional
coordinates of each word in our recommender system."

i.e. it's nice for visualisation, and it removes noise. It would be
interesting to see discussion from y-hat about where the sweet spot between
lack of noise and still keeping relevant information is. I think because the
subject matter is pretty simply to cluster, 2D works well enough and keeps
everything simple.

------
vinchuco
Nitpicking: NOT (human + robot) ≈ cyborg BUT average(human + robot) ≈ cyborg

Some things that come to mind:

I'd be interested to see other vector operations such as projection of one
word into another in the examples. Also, only nouns yet.

How is ≈ defined, if the distance to the closest word vector is not
necessarily unique?

Finally, what is the proportion of words that maintain human meaning when
averaged to those that are nonsense? What are the most "meaningful" words, in
that sense?

------
vonnik
how is this different than TSNE?

[https://lvdmaaten.github.io/tsne/](https://lvdmaaten.github.io/tsne/)

anyone looking for an explanation of word2vec may find this helpful:

[http://deeplearning4j.org/word2vec](http://deeplearning4j.org/word2vec)

~~~
lmcinnes
It uses t-SNE but there other other working parts here, including word2vec
(and some nice compression of a pre-trained model), keyword searching to
provide context for terms, and clustering to find natural dense groups. It's a
nice pipeline that fits together a bunch of independently interesting parts
into a single system that can produce quite remarkable results.

------
ganeshkrishnan
Hi, I was in the middle of creating "user personalities" using K-means
clustering.

Is it ok to reference your document for our papers? MIT licence is awesome and
let us reuse your tech. Our site is at www.shoten.xyz if you are interested to
know what we are doing

~~~
legel
Thanks, shoten looks wonderful and exciting. You're welcome to reference/reuse
words2map as you wish, while the authors of word2vec, t-SNE, HDBSCAN, and
others deserve all the credit. :)

~~~
ganeshkrishnan
Thanks for the kind words. We have bit of known unknowns and I am sure plenty
of unknown unknowns. What we are trying to implement is a rank based on
"personality cluster" of the person querying. So in your example Pablo Picasso
searching would get slightly different results compared to Kanye West (because
they are different personality clusters)

~~~
legel
That sounds awesome, and technically achievable in various capacities.
Something a colleague thought of recently that may be of interest is
hierarchical mapping, particularly if the number of personalities is great -
so, e.g., having a map for all users, and then having unique maps for each
users, and programmatically transitioning across maps like this to the extent
that "zooming in" is useful. Kind of fun to think about. The other thing that
comes to mind is the ironic dependency of words2map right on the elephant in
your room: Google. In particular, when you dive into the code, you'll see that
the first searches are free, but then become $5 / 1000 at scale, which is
probably not good for your objective. Therefore you may wish to replace the
Google search functionality of words2map with your own search engine. If you
explore doing so, I'd love to see the results, and would be happy to
incorporate as an option for words2map users looking for a completely free
implementation. Just have a look at the research_keywords function if this of
interest. And best of luck!

~~~
ganeshkrishnan
Pretty close to what I am trying to do, hierarchical "soft/overlapping"
clustering is what I am investigating. I have the design ready for matching
user cluster to corpus cluster. Then I use okapiBM25 to get primary results
from this cluster and rank them (with an undecided algorithm).

Then, second step, I augment the rank (increase or decrease) subset of top
results with predetermined queries that match these results.

For example among two equal documents & search query "Who is Tommen" if the
second document has more people clicking then I increase the pagerank for
second document (by a function of how many more people prefer the second
document)

~~~
legel
This is an interesting approach. The idea of overlapping clusters is
appealing, you're probably familiar with the LDA family of algorithms that
provide nice data structures for this.

------
sixhobbits
human + robot ≈ cyborg

electricity + silicon ≈ solar cells

virtual reality + reality ≈ augmented reality

\--

These always seem impressive in word vector models, but in reality, I imagine
that "robot" and "cyborg" were already pretty close. The fact that adding
"human" nudged the vector closer is likely not as meaningful as it would be
nice to believe. The same for "electricity/solar cells" and "virtual
reality/augmented reality"

Still a really nice application for word2vec, and I'm looking forward to
seeing other similarly practical implementations in future.

~~~
crypto5
They demonstrated few somehow meaningful rules, generated by model, but it is
interesting how much garbage it contains?

------
visarga
I think you can also get pretty good suggestions with plain old bag-of-words,
tf-idf and k-means.

~~~
ganeshkrishnan
bag of words don't give us suggestions. while tf-idf is for searching corpus
not clustering.

K-means is clustering and similar to this,correct.

