Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Wordllama – Things you can do with the token embeddings of an LLM (github.com/dleemiller)
365 points by deepsquirrelnet 3 days ago | hide | past | favorite | 36 comments
After working with LLMs for long enough, I found myself wanting a lightweight utility for doing various small tasks to prepare inputs, locate information and create evaluators. This library is two things: a very simple model and utilities that inference it (eg. fuzzy deduplication). The target platform is CPU, and it’s intended to be light, fast and pip installable — a library that lowers the barrier to working with strings semantically. You don’t need to install pytorch to use it, or any deep learning runtimes.

How can this be accomplished? The model is simply token embeddings that are average pooled. To create this model, I extracted token embedding (nn.Embedding) vectors from LLMs, concatenated them along the embedding dimension, added a learnable weight parameter, and projected them to a smaller dimension. Using the sentence transformers framework and datasets, I trained the pooled embedding with multiple negatives ranking loss and matryoshka representation learning so they can be truncated. After training, the weights and projections are no longer needed, because there is no contextual calculations. I inference the entire token vocabulary and save the new token embeddings to be loaded to numpy.

While the results are not impressive compared to transformer models, they perform well on MTEB benchmarks compared to word embedding models (which they are most similar to), while being much smaller in size (smallest model, 32k vocab, 64-dim is only 4MB).

On the utility side, I’ve been adding some tools that I think it’ll be useful for. In addition to general embedding, there’s algorithms for ranking, filtering, clustering, deduplicating and similarity. Some of them have a cython implementation, and I’m continuing to work on benchmarking them and improving them as I have time. In addition to “standard” models that use cosine similarity for some algorithms, there are binarized models that use hamming distance. This is a slightly faster, similarity algorithm, with significantly less memory per embedding (float32 -> 1 bit).

Hope you enjoy it, and find it useful. PS I haven’t figured out Windows builds yet, but Linux and Mac are supported.






Nice. I like the tiny size a lot, that's already an advantage over SBERTs smallest models.

But it seems quite dated technically - which I understand is a tradeoff for performance - but can you provide a way to toggle between different types of similarity (e.g. semantic, NLI, noun-abstract)?

E.g. I sometimes want "Freezing" and "Burning" to be very similar (1) as in regards to say grouping/clustering articles in a newspaper into categories like "Extreme environmental events", like on MTEB/Sentence-Similarity, as classic Word2Vec/GloVe would do. But if this was a chemistry article, I want them to be opposite, like ChatGPT embeddings would be. And sometime I want to use NLI embeddings to work our the causal link between two things. Because the latter two embedding types are more recent (2019+), they are where the technical opportunity is, not the older MTEB/semantic similarity ones which have been performant enough for many use cases since 2014 and 2019 received a big boost with mini-lm-v2 etc.

For the above 3 embedding types I can use SBERT but the dimensions are large, models quite large, and having to load multiple models for different similarity types is straining on resources, it often takes about 6GB because generative embedding models (or E5 etc) are large, as are NLI models.


Great ideas - I’ll run some experiments and see how feasible it is. I’d want to see how performance is if I train on a single type of similarity. Without any contextual computation, I am not sure there are other options for doing it. It may require switching between models, but that’s not much of an issue.

Its a 17 MB model that benchmarks obviously worse than MiniLM v2 (which is SBERT). I run V3 on ONNX on every platform you can think of with a 23 MB model.

I don't intend for that to be read as dismissive, it's just important to understand work like this in context - here, it's that there's a cool trick where if you get to an advanced understanding of LLMs, you notice they have embeddings too, and if that is your lens, it's much more straightforward to take a step forward and mess with those, than take a step back and survey the state of embeddings.


I assume that by "ChatGPT embeddings" you mean OpenAI embedding models. In that case, "burning" and "freezing" are not opposite at all, with a cosine similarity of 0.46 (running on text-embedding-large-3 with 1024 dimensions). "Perfectly opposite" embeddings would have a similarity of -1.

It's a common mistake people make, thinking that words that have the opposite meaning will have opposite embeddings. Instead, words with opposite meanings have a lot in common, e.g. both "burning" and "freezing" are related to temperature, physics, they're both english words, they're both words that can be a verb, a noun and an adjective (not that many such words), they're both spelled correctly, etc. All these features end up being a part of the embedding.


This might be a dumb question but... if I get the embeddings of words with a common theme like "burning", "warm", "cool", "freezing", would I be able to relatively well fit an arc (or line) between them? So that if I interpolate along that arc/line, I get vectors close to "hot" and "cold"?

This was the original argument for the King-Queen-Man-Women Word2Vec paper - it turns out no, not beyond basic categories. Yes to a degree. But all embeddings as trained based on what the creator decides they want them to do; to represent semantic(meanginful) similarity - similar word use - or topics or domains - or level of language use - or indeed to work multilingually and to clump together embeddings in one language, etc.

Different models will give you different results - many are based on search-retrieval, for which MTEB is a good benchmark. But those ones won't generally "excel" at what you propose, they'll just be in the same area.


You are missing the woods for the trees in my point. LLM based (especially RLHF) embeddings allow you to do much more and encode greater context that either "this thing is being used as a potent adjective", or "this thing is a noun similar to that other [abstraction] noun" <-- Word2Vec or "this thing is similar in terms of the whole sentence when doing retrieval tasks" <-- SBERT

If you can't see why it is useful that neither Word2Vec or SBERT can put "positive charge" and "negative charge" in very different, opposite embedding space while LLM and RLHF based embeddings can, you don't understand the full utilization possible with embeddings.

Firstly, you can choose what you embed the word with, such as "Article Topic:" or "Temperature:" to adjust the output of the embedding and results of cosine similarity to be relevant for your use case (if you use a word-based embedding, which captures much less than a sentence for search/retrieval/many other tasks like categorising)

Secondly, by default, these models are not as "dumb state" as the original slew of Word2Vec and GloVe, which yes would score very highly for words like "loved" and "hated" in similar use as adjectives, which caused issues for things like semantic classification of reviews, etc. Whereas these models encode so much more, that they see the difference between "loved" and "hated" is much bigger than that between "loved" and "walk" for example. *This is already a useful default step up, but most anyone using RLHF embeddings is embedding sentences to get the best use out of them*

Your understanding of embeddings is rather flawed to focus on "hey're both english words, they're both words that can be a verb, a noun and an adjective (not that many such words)". Why do embeddings in different languages with the same semantic meaning land closer in space than two unrelated english languages? The model has no focus on part of speech type, and is ideally suited to embedding sentences, where with every additional token, it can produce a more useful embedding. Being spelled correctly belies that you have a miscomrephension that these systems are a "look up" - yes they are for one word, and if you spelt that one word wrong (or token which represents multiple words, one token), you'd get a different place in embedding space and one very wrong. However, when you have multiple tokens, a mispelling moves the embedding space very little, because the model is adept at comprehending mispelling and slang and other "translation"-like tasks early, and making their effects irrelevant for downstream tasks unless they are useful to keep around. Effective resolution of spelling mistakes is anyhow possible with models as small as 2-5GB, as T5 showed way back in 2019, and I'd posit even some sentence similarity trained models (e.g. based on BERT which had a training set with some spelling errors) treat spelling mistakes essentially the same way.

I am aware of the options from OpenAI for embeddings, as I have used them for a long, long time. The original options were each based on the released early models, especially ada and babbage, and though the naming convention isn't clear any more, the more recent models are based on RLHF models, like ChatGPT, and hence I mention ChatGPT to make it clear to cursory readers that I am not referring to the older tier of embedding models by OpenAI based on non-RLHF models.


Tone of your post is really strange and condescending, not sure why. You made a statement that I, in my work, very often see people make when they first start learning about embeddings (expecting words that we humans see as "opposite" to actually have opposite embeddings), and I corrected it, as it might help other people reading this thread.

> Firstly, you can choose what you embed the word with, such as "Article Topic:" or "Temperature:" to adjust the output of the embedding and results of cosine similarity to be relevant for your use case

As far as LLM-based embeddings go, unless you train the model for this type of format, this is not true at all. In fact, the opposite is true - adding such qualifiers before your text only increases the similarity, as those two texts are, in fact, more similar after such additions. I am aware that instruct-embedding models work, but their performance and flexibility is, in my experience, very limited

As for the rest of your post, I really don't see why you are trying to convince me that LLM-based embeddings have so much more to them than previous models. I am very well aware of this - my work revolves around such new models. I simply corrected a common misconception that you gave, and I don't really care if you "really think that" or if you know what the truth is but just wrote it as an off-hand remark.


Saying "Perfectly opposite" does not need to mean the mathematical cosine similarity would be -1. The point you implied by bringing up this irrelevant information is to be dismissive of the relevance of generative model embeddings for different tasks (and 0.41 is less similar than you get in previous embedding modes which don't have the rich context of LLMs or RLFF models). This is why you got the snarky tone back, you took an unnecessary literal interpretation, and revealed in your later paragraphs a dated attitude to embeddings that you tend to get from a surface level understanding i.e. that adjective, noun or other PoS type or presence is more important for similarity (e.g. adjectives are closer to each other in Word2Vec but NOT consistently so in generative embeddings).

Ofcourse embeddings prefixed will be generally closer. You misunderstand the use case and are looking at embeddings in an outdated way. The point is thus:

When I want to use embeddings to model newspaper articles, I put "Article:" infront of the topic as I embed it, and for that purpose they will suite my needs better. When I need to use embeddings for temperature or scientific literature purposes, I might put "Temperature:" in front of them, and "Burning"/"Freezing" will be further apart. That is useful in a way that Word2Vec, GloVe and even to lesser degree SBERT cannot do.

The misconception you claim is based on Word2Vec and GloVe and not true generall - words can have several senses with polysemy, as can phrases anyhow so it's a difficult point to argue for in the first place - when you say " words that have the opposite meaning will have opposite embeddings. Instead, words with opposite meanings have a lot in common" is only true of embeddings from Word2Vec, GloVe, and the early BART era, which are quickly falling out of fashion as they are limited. Your understanding is dated, and you see a misconception, because you have failed to adequately explore or understand the possible use cases or representations viable with these embeddings. There is so much more! You can embed across languages. You can embed conversations!

As for your call to authority - I don't need to make such a claim - I'm sorry if you work in a job stuck in the past trying to apply pre 2020 understanding of NLP to 2024 models but well, that sounds like your choice. To me, it sounds like you're assuming the past holds true and taking points absolutely; is that really wise in a fast-changing field? There have been several hackathons about embeddings. Try exploring the recent ones and look at what is really possible.


Which part of the "my work revolves around such new models" did you misunderstand? Claiming that:

> Ofcourse embeddings prefixed will be generally closer.

and then later:

> When I need to use embeddings for temperature or scientific literature purposes, I might put "Temperature:" in front of them, and "Burning"/"Freezing" will be further apart.

is awesome. Have a good day!


“Ofcourse embeddings prefixed will be generally closer is true;” They are closer in that “Temperature: Burning”, “Temperature: House”, and “Temperature: Freezing” will all be closer than “Burning”, “House”, and “Freezing”.

You seem confused again, because you are limited by prior beliefs about embeddings. The relative increase in distance between “Temperature: Burning” and “Temperature: Freezing” is the value and what we want from generative embeddings;

It means that use adding “Temperature:” allows us to differentiate more based on “Temperature:” here, whether or whether not prefixing “Temperature:” to everything puts them in closer space; I mentioned that only to elucidate that prefixing text bringing them closer is an irrelevant counter argument when the relative distance increases more

“Temperature: Burning”, “Temperature: House”, and “Temperature: Freezing” all being closer is irrelevant to that, because we work out the absolute cosine similarity between the examples we have, and we can use the extremes to cluster or do other work later on, so the group being closer, has no impact on the usefulness of the distance between “Temperature: Freezing” and “Temperature: Burning” for things like e.g. k-means clustering

Let me clarify finally to you: My point was that embeddings released more recently allow you to do more interesting things and embed different word (and sentence) and other senses of content more meaningfully that the previously limited approaches of Word2Vec, GloVe etc. You are taking it as given that adjectives being used in similar sentence places means they innately should have similar embeddings and that I’m being stupid for missing that this would be so; what you aren’t seeing is that this is an artifact of the limited contextual representation that LLMs previous to GPTs could address, and the fact that they were not modelling text in the same way (i.e. learned positioned embeddings and context lengths of 1,000 vs single sentence training samples with masked tokens). It was a limitation not a positive property of previous embedding regimes like Word2Vec to put adjectives which had opposite semantic meanings in the same embedding space and one that NLU researchers often used a criticism of the field.

As to why your point on the cosine similarity of “Burning” and “Freezing” not being “perfectly opposite” – I said ‘opposite’ not perfectly opposite, and I wasn’t referring to -1 which you have misinterpreted. Perfectly opposite has a different meaning to opposite. The claim I desire “Freezing” and “Burning” to be opposite is not that cosine similarity would be -1 as that is obviously absurd for a transformer training model, and also fairly obvious to anyone who works with embeddings who will notice they tend to clump together. The claim is that, relative to other alternate embeddings, the cosine similarity should be much less high than it would be with the older models, especially if you contextually prefix the embedding (such prefixing was not that effective – or at least not consistently effective – with BERT, and for obvious reasons useless for a word-embedding regime, since you are just providing the same word vector to both comparisons, so the cosine distance won’t change. Which it is, the cosine similarity of “Freezing” and “Burning” is lower with than BERT, the cosine embedding of “great disadvantage” and “great advantage” is lower too – rather contradicts your universal claim that “as adjectives, they are close together!”. No, that’s just an artifact of older models and was more a limitation than a feature. It’s obvious to a mathematical mind who works with similarity measures regularly that the two cosines of embeddings trained in a multi-sample unsupervised learning system, especially a transformer network, can never have a cosine of -1, whether you train them on the minimum square error of positive examples (of very similar sentences) or where you train them with negative samples (providing two pairs of sentences which are dissimilar, e.g. contrastive learning). You encounter this with PCAs, you encounter this with the layers of neural networks, you encounter this with embeddings, it’s a regular property in machine learning. So I don’t mean to be rude but it’s clear you don’t have much of a deep understanding of machine learning to be unaware of this. I don’t mean to be rude but with sufficient experience with similarity measures and machine learning, this is obvious – you never see an absolute extreme value because of the pull of samples as you go through the training data. In fact for many years (2008-2018) many practitioners struggled with issues like nodes with 0, inert values until new approaches overcame that issue. Ofcourse it is possible to have a -1 representation when simply manipulating matrices without training. Therefore, the fact you took this claim literally, told me you don’t understand transformer embedding architectures, and are probably unaware of how much embeddings have changed. Given you missed this in my original post, I hope it’s became clear to you.

In the limit the majority of the last encountered training samples were not the two samples in question, either as they were unencountered, or beause the training set has at least as many samples as tokens, and there are 56k-100k tokens depending on LLM model, and within the last 5 (ignoring batches) at least 3 were not the two. Each sample has on average the same pull on the cosine space weights within the network because of normal distribution. Each training iteration necessary affects all weights. The majority of iterations affect the network node strengths in directions which may be similar but are not the same as those pulling towards a cosine similarity of -1 for our two given two samples, with this effect increasing with the size of the training set, therefore, we can say that a cosine similarity of -1 is not possible. So long as there are 5 unique pieces of text in a transformer architecture training regime, and the random weights are not initialized to be -1, there never will be a cosine similarity of -1. Same proof exists for PCA and basically most training regimes with distributed samples (not even normal distribution is necessary) and any kind of weight based learning of representations, like in geometric proofs.


Embeddings capture a lot of semantic information based on the training data and objective function, and can be used independently for a lot of useful tasks.

I used to use embeddings from the text-encoder of CLIP model, to augment the prompt to better match corresponding images. For example given a word "building" in prompt, i would find the nearest neighbor in the embedding matrix like "concrete", "underground" etc. and substitute/append those after the corresponding word. This lead to a higher recall for most of the queries in my limited experiments!


Yup, and you can train these in-domain contextual relationships into the embedding models.

https://www.marqo.ai/blog/generalized-contrastive-learning-f...


That’s a really cool idea. I’ll think about it some more, because it sounds like a feasible implementation for this. I think if you take the magnitude of any token embedding in wordllama, it might also help identify important tokens to augment. But it might work a lot better if trained on data selected for this task.

Any plan for languages other than english? This would be a perfect tool for french language.

It’s certainly feasible. I’d need to put together a corpus for training and I’m not terribly familiar with what’s available for French language.

I have done some training with the Mistral family of models, and that’s probably what I’d think to try first on a French corpus.

Feel free to open an issue and I’ll work on it as I find time.


Very interested in a multilingual version too!

FYI huggingface hosts datasets too. And wikipedia has a nice portal for datasets : https://en.m.wikipedia.org/wiki/List_of_datasets_for_machine...


With a large corpus (10,000+ sentences--each sentence is a "document" in my usecase) I can get similar results by kmeans clustering TF-IDF spmatrix vectors but it looks like this has a lot of utilities for making the kmeans part faster (binarization, etc).

Looking forward to doing some benchmarking over the next couple weeks


I wrote a set of "language games" which used a similar set of functions years ago: https://github.com/Hellisotherpeople/Language-games

Interesting... looks like this uses pymagnitude

https://github.com/plasticityai/magnitude


Has anyone thought of using embeddings to solve Little Alchemy? #sample-use

Looks like someone remade https://neal.fun/infinite-craft/

I thought it was the other way around. First little alchemy, then they used an LLM to create a better version of it.

That's possible- I may have assumed incorrectly

Looks cool! Any advantages to the mini-lm model - it seems better on most mteb tasks but wondering if maybe inference or something is better.

Mini-lm is a better embedding model. This model does not perform attention calculations, or use a deep learning framework after training. You won’t get the contextual benefits of transformer models in this one.

It’s not meant to be a state of the art model though. I’ve put in pretty limiting constraints in order to keep dependencies, size and hardware requirements low, and speed high.

Even for a word embedding model it’s quite lightweight, as those have much larger vocabularies are are typically a few gigabytes.


Which do use attention? Any recommendations?

Depends immensely on use case — what are your compute limitations? are you fine with remote code? are you doing symmetric or asymmetric retrieval? do you need support in one language or many languages? do you need to work on just text or (audio, video, image)? are you working in a specific domain?

A lot of people wind up using models based purely on one or two benchmarks and wind up viewing embedding based projects as a failure.

If you do answer some of those I’d be happy to give my anecdotal feedback :)


Sorry, I wasn’t clear. I was speaking about utility models/libraries to compute things like meaning similarity with not just token embeddings but with attention too. I’m really interested in finding a good utility that leverages the transformer to compute “meaning similarity” between two texts.

Most current models are transformer encoders that use attention. I like most of the options that ollama provides.

I think this one is currently the top of the MTEB leaderboard, but large dimension vectors and a multi billion parameter model: https://huggingface.co/nvidia/NV-Embed-v1


looks like it's the size of the model itself, more lightweight and faster. mini-lm is 80mb while the smallest one here is 16mb.

Mini-lm isn't optimized to be as small as possible though, and is kind of dated. It was trained on a tiny amount of similarity pairs compared to what we have available today.

As of the last time I did it in 2022, Mini-lm can be distilled to 40mb with only limited loss in accuracy, so can paraphrase-MiniLM-L3-v1 (down to 21mb), by reducing the dimensions by half or more and projecting a custom matrix optimization(optionally, including domain specific or more recent training pairs). I imagine today you could get it down to 32mb (= project to ~156 dim) without accuracy loss.


What are some recent sources for high quality similarity pairs?

This is great for game making! Thank you!

This shows just how much semantic content is embedded in the tokens themselves.

hmm ... postgresql extension?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: