Hacker News new | past | comments | ask | show | jobs | submit login
Why do we use word embeddings in NLP? (medium.com)
56 points by data_nat 12 days ago | hide | past | web | favorite | 35 comments





After watching Andrew Ng's course on NLP is when it finally all clicked to me. Basically, every word can be numerically described with up to 300 properties, some properties being more important. So every word is a 300-dimensional vector and vector algebra can be applied. So if you have the word-vector 'king' and you subtract the one dimensional 'manliness' vector from it the result is the word-vector queen. Likewise, it you subtract the queen from the king the resulting vector is the one dimensional vector 'manliness'. Awesome!

> Basically, every word can be numerically described with up to 300 properties, some properties being more important

I'd give a bit more of a nuanced view here -- we can choose any number of properties (dimensions) to represent words, which are all learned from a corpus. 300 dimensions is a pretty popular choice. These dimensions aren't (generally) interpretable: they represent latent properties. In other words, it's not possible to say which property each dimension represents, it's simply one that your word embedding algorithm has picked up in the data. Generally speaking, feature importance is hard to define for the same reason.


Imagine that we made word vectors out of PCA reduced sparse tf-idf or countvectorized vectors. I can tell you exactly what each PCA component explains. I could even do that at the word level because it's not difficult to do inverse transforms with some simple dimensionality reduction techniques

The model interpretability goes out the window because we used techniques for the vectorization that kinda suck. NLP is obsessed with self-supervision unnecessarily when they should be innovating in dimensionality reduction techniques


Why do you think NLP practictioners are focusing on self-supervision instead of dimensionality reduction?

I agree, and I have an idea for this dimensionality reduction which makes the original unsupervized word vectors interpretable.

it boggles my mind I haven't seen anyone implement my idea.


SVD has been used for dimensionality reduction of co-occurrence matrices for ages [1], but the resulting word embeddings aren't as performant as those of word2vec/etc. The same is probably true of using PCA.

Word2vec's popularity is the result of people valuing performance (i.e. accuracy) more than interpretability.

[1] https://dl.acm.org/citation.cfm?id=148132


Well, it might be because it's hard to read your mind from here.

no, that wouldn't be mindboggling

what's mindboggling to me is that I haven't seen anyone else come up with the idea independently.


it's so obvious one wouldn't have to read my mind, it's all implicit in the king-man+woman=queen type of relations... if you really ask a second time, fuck it, im not in ML sector, perhaps I just give away the idea...

I agree. I guess Andrew chose those examples to better illustrate what those properties could represent.

Except this doesn't work consistently because words in real language are ambiguous and not every combination results in something that can meaningfully map to real-world language.

You need to start from a word list with 100% unambiguous and clearly defined words and even then you're no step closer to working with real language because while superficially similar your word list is actually a highly specialised DSL.

Of course in many cases this DSL approximation of the target language is good enough for certain tasks but the entire process is inherently flawed.


I'm afraid you misunderstand the way embeddings work - at least for BERT based models, which are currently state of the art.

BERT embeddings, after training change with context. In other words if you feed a paragraph about bank robbers and look at the encoding for bank, it will be meaningfully different from the encoding for the same word produced from a paragraph (or sentence) about river banks.

We use BERT at the startup I work at, and one of our tests was the sentence "the bank robbers robbed the bank and then rested by the river bank". BERT was able to generate three different semantically meaningful encodings for the word bank in this sentence. The first two instances were much closer to each other in vector space (euclidean distance) than the last.

This is huge, because it is arguably the first step in building an AI which can perform basic reasoning about information encoded in text. For example, if you average up the encodings of a paragraph of words, you can create an "encoding" which assigns a summary meaning or topic. Simple vector math becomes a powerful reasoning tool.

The future is here.


> This is huge, because it is arguably the first step in building an AI which can perform basic reasoning about information encoded in text.

Well, except for the many many decades of previous work on NLP using symbolic methods that are quite capable. Although DNNs are en vogue and have some amazing properties, we shouldn't forget that symbolic AI/NLP using explicitly semantic representations is powerful and has a rich history, and complements DNNs quite well -- such as being easily explainable, for one.


The contextualized word embeddings you get out of BERT are still generated from fixed per-word vectors. And while you get one output vector for each input vector, that doesn't mean they correspond to each other. The model could arbitrarily reshuffle information between outputs, so long as the output as a whole reflects the input sufficiently well. So BERT embeddings are not "word embeddings" in the usual sense.

> approximation of the target language is good enough for certain tasks but the entire process is inherently flawed

That is pretty much the definition of a "model" :)

I recently went through the "Tensorflow in Practice" specialization on Coursera and it was illuminating. The thing about ML models, whether CNNs for images, or word2vec+RNN, or whatever else, is that they really don't have any rigid scientific basis for why they work. You're doing, say, Stochastic Gradient Descent to optimize the neuron weights across your dataset. Out the other side of the training, you have a mostly meaningless set of coefficients that work well to classify other unseen data.

I dual-majored in CS and EE, and I leaned towards the "science" side of things, where things get modelled mathematically and analyzed, accepting that the model is likely incomplete but still useful. The thing that drives me nuts with ML is that there's no explanation of what the terms in the ML model actually mean (because the process that produced them doesn't actually investigate meaning, it just optimizes the terms). But... I've accepted that even though the models are pretty much semantically meaningless, they work.


> I've accepted that even though the models are pretty much semantically meaningless, they work.

Until they don't. Which may happen easily if you deploy a model for the first time.

My personal view is that (at this moment) ML is mostly correlation detection and pattern recognition, but has little to do with intelligence.


The point is that we don't have mental capacity to understand this stuff. Nobody has any clue how to interpret millions of dimensions, some non-linear manifold there and how to translate it to something humans are capable of understanding. These things might be done automatically by our brains on subconscious level in a similar fashion (or not), but on conscious level we are completely clueless and basically shoot darts to see which ones become somewhat useful.

I think you object to the lack of "mathematical beauty", but my point is "who cares?". Not sure why should reality conform to some mental model we find "appealing" for whatever reason. Deep Learning is similar to experimental physics.


This.

Explainable AI is a emerging field, I hear about this necessity specially in NLP and Law. We expect to understand how some decision was reached, and we'll never accept some computer-generated decision if it wasn't explained how each logical step was done. And just giving millions of weights of each neuron won't give us that, because we won't be able to reach the same decision with just those parameters.

We know that IA is a bunch of probabilities, weights and relations in n-dimensions. Our rational brain can know that too, but can't feel it.


That's why you use interpretability tools like LIME

Example of this would be here: https://github.com/Hellisotherpeople/Active-Explainable-Clas...


Wait... That's what dimensionality reduction is for. I can interpret 3 PCA dimensions pretty damn well since I can figure out the covariance explained by each dimension of my original dimensions

Yeah, but your accuracy also drops. You might end up with interpretable underwhelming solution instead of non-interpretable SOTA.

As you hint at, the (more common) alternative to defining the world in advance is to rely on machine learning to figure things out (and possibly generate separate clusters for meanings that don't even map well to specific words but resolve the ambiguity). But even then you can run into problems. Even if your model can parse "cloud" correctly based on the context, good luck trying to parse a text about online storage of meteorological data.

The ontological approach described in the article doesn't really work all that well with real world data.

The raw ML approach works well enough but has a multitude of problems (e.g. learning biases, like "black" being a negative sentiment classifier when talking about people because of the texts the model was initially fed).

But given how hard it is to "solve" these problems, I'm not convinced ML alone will ever progress beyond the 80% "good enough" solution it is now, without being replaced with something completely different.

This is what makes me skeptical of all the tall tales about strong AI and "the singularity". While the specialised applications (e.g. deepfakes) are certainly impressive and a lot of the more generalised applications can go a long enough way to get a decent amount of funding despite unfixable flaws (e.g. sentiment analysis), getting from "here" to "there" seems to require more than just more incremental refinement.

Computer Linguistics courses have been teaching ontological "scientifically sound" approaches that yielded no real-world applications while Google had been eating their lunch with dumb statistical models. The dumb models have since become infinitely more intricate and improved from "barely usable" to "good enough" but seem to be inching ever closer to an insurmountable wall, whereas the "scientific" models still seem to be chasing their own tail describing spherical cows in a vacuum.


Well, science includes scientific method. Trial and error that is abundant in deep NN ML research is at the core of science. The theories and math have almost always arrived after stuff worked heuristically.

That's also one of the reasons why deep NN returned the spark to ML. ML was so deep into the proven models and math that the lack of trial and error part slowed down progress.


I agree:

  king - manliness
is nonsensical in the context of chess.

That's a really good point, and related to my sibling comment, one of the other things that makes me uncomfortable about ML models is the mysterious generalizability. Some generalize well, some start spewing nonsense when you move slightly outside of their training zone. Special-purpose models, I think, do quite a bit better than attempted larger generalized models.

For your chess example, if you trained a word2vec model using only a large corpus of text about chess, you very likely wouldn't get the "king - manliness" vector being anything meaningful at all, but you would likely see word associations that are meaningful and also potentially unexpected.


What you describe is word2vec which is one kind of word embedding that is pre-learned. Often (with keras at least ), embedding is learned simultaneously with the deep learning network. Keras uses index of words to progressively adjust a set of randomly initialized n-dimensional vectors. I’m telling you because I feel it may be the case for you too. The various kind of embedding was very confusing at first.

However, I just marvel at word2vec when I stumbled upon it. Encoding of meaning as vector dimensions was mind expanding for me.


These guys laid a lot of the groundwork for embeddings before word2vec while also showing practical applications in finance https://www.elastic.co/blog/generating-and-visualizing-alpha...

It's also possible to initialise a word embedding matrix with vectors trained with word2vec (or any other type of pre-trained embeddings: fastText, GLoVe being common) to get a perf boost. Of course, you have to ensure the embedding matrix is pre-populated with respect to the word-index mapping. Then, as you describe, these can be adjusted during training, although sometimes they're fixed. The boost is more noticeable when you're training a model on small amounts of data.

For me, the marvel of word2vec is that we do not need to embedding meaning into words. The move toward semantic understanding and meaning is not essential, but this post think that a semantic representation is the next logical step.

The king, queen, man, women example was in essence hand crafted and deceptive and this whole "arithmatic" property has now been debunked: https://twitter.com/goodfellow_ian/status/113352818965167718...

the exactness has been debunked, but the perplexity for the correct answer is very low...

Are there "recommended", pretrained image featurizers -- maybe intermediate layers from CNNs and all that?

I know easy-to-use object recognition models, but for general clustering, metric learning, etc. tasks it would be useful to have an abstract embedding.


BERT & GPT-2 are state of the art, pre-trained models with transfer learning in mind.

https://www.cv-foundation.org/openaccess/content_cvpr_worksh...

Here's a nice paper. But yes, almost any CNN works.

VGG-16 or VGG-19 seem to be used the most.


Yeah, but if I understood it correctly the VGG-16 that comes with Keras is trained to give a probability distribution over 1000 interpretable labels. I was hoping for something that, like word2vec, embedded image data in low-ish dimension.



Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: