Sentence Embeddings have a problem, the reason sometimes Dall-E2 fails

radarsat1 · on June 25, 2022

> Sentences like “People playing cricket with a tennis bat” and “People playing tennis with a cricket bat” have very high similarity scores even though we know that they are different sentences.

when i first looked up sentence embeddings several years ago i was quite surprised how often it was recommended to just take the mean of the word embeddings composing the sentence, as i thought it must be too simple and make this kind of mistake. (also, i just wasn't convinced that you could summarize a sentence's meaning by averaging the words it contains) but, this was before transformers, and i figured by now that sentence embeddings had gotten more sophisticated so I'm a little uncertain why modern, large models would have the same problem.

Der_Einzige · on June 25, 2022

The fact that we ultimately have nothing significantly better than these pooling techniques is one of the greatest letdown moments in AI to me, right up there with learning how the various language model decoding techniques work.

The techniques we have for pooling or decoding are so infantile in comparison to the encoders it's actually pathetic.

On the other hand, it means we have lots of publication opportunities for anyone who can do better...

radarsat1 · on June 25, 2022

I think a lot of it comes down to how these things are trained. You can predict a word, and then easily calculate whether it's right or wrong. So it makes it easy to design loss functions around "word things". It's a lot harder to design loss functions around the complex intricacies of word combinations -- whether or not the network correctly predicts the sentence's meaning is much harder to calculation than whether or not it can predict a certain word. For me, the surprise is that so much sophistication pops out after training purely at the word level. However, I guess it's not surprising to find some limitations to that.

But if we did have a way to calculate loss directly on sentence meaning, then you could train a much more intelligent pooling function. In principle. I'm just not sure how that loss function would look. Maybe question answering tasks are part of the key.

carschno · on June 25, 2022

It's a a pretty big step from "high similarity scores" to concluding that was the reason for failure. High is relative in this context, and the example sentences are arguably more similar to each other than to most other sentences (the definition of similarity I is another topic). From the vector similarity, we don't know much about the actual vector embeddings of the respective sentences, nor about how they impact the model as a whole. Obviously, something goes wrong for the given examples, but I don't think you can pin the problem down to one particular layer in a huge model like Dall-e.

nutanc · on June 25, 2022

The embeddings provide the guidance for the diffusion models. So they have a direct influence. Similarity of the sentences means that they are around the same area in the multi dimensional space. And when they are used to guide the diffusion models, the images come out wrong. One thing is the experiments I have done are on sentence transformer models. Dalle-2 uses its own sentence embedding afaik.

hprotagonist · on June 25, 2022

Sentence embeddings have precisely the limitations you'd expect for syntax-only correlative models, which is what they are.

We forget this at our peril.

Sometimes we also forget that semantics also exist, but “you can’t argue with a zombie” was already posted recently …

axg11 · on June 25, 2022

Is using a single embedding sufficient to capture the entire meaning of a sentence?

High similarity scores between sentences is not the whole picture. A sentence and it’s negative/opposite are naturally very similar in topic and sentence structure. The mapping from text embedding to image needs to be aware of the most important dimensions of the embedding when constructing the output image.

radarsat1 · on June 25, 2022

this is a really good point.

domenicrosati · on June 25, 2022

I am all for independent discovery but the author should know in the language modeling community we have known this for years [1]. The authors of DALLE are well aware of it. We have many concerted efforts to try to solve for it. Heck even have whole conferences (semeval, *sem) dedicated to a host of known issues beyond negation (try adding quantification (every/some), or quantity (1,2,3, one,few,many) to the prompts) that we just know both theoretically and empirically dont work with our current embeddings (see [2] for a great overview sorry is paywalled).

[1] https://aclanthology.org/2020.acl-main.698/ [2] https://www.annualreviews.org/doi/pdf/10.1146/annurev-lingui...