Hacker News new | past | comments | ask | show | jobs | submit login
Embeddings: What they are and why they matter (simonwillison.net)
668 points by simonw on Oct 25, 2023 | hide | past | favorite | 131 comments



Since publishing this I've found a few additional resources that are really useful for understanding embeddings at a lower level (my article is deliberately very high level and focuses mainly on their applications).

Cohere's Text Embeddings Visually Explained: https://txt.cohere.com/text-embeddings/

The Tensorflow Embedding Projector tool: https://projector.tensorflow.org/

What are embeddings? by Vicki Boykis is worth checking out as well: https://vickiboykis.com/what_are_embeddings/

Actually I'll add those as "further reading" at the bottom of the page.


I had exactly the same idea a while back:

https://blog.scottlogic.com/2022/02/23/word-embedding-recomm...

Using embeddings I increased engagement with related articles.

Personally I think embeddings are a powerful tool that are somewhat overlooked. They can be used to navigate between documents (and excerpts) based on similarities - or conversely find unique content.

All without worrying about hallucinations. In other words, they are quite ‘safe’


> All without worrying about hallucinations. In other words, they are quite ‘safe’

Within limits, yes. In some use cases a vector notion of similarity isn't always ideal.

For example, in the article "France" and "Germany" are considered similar. Yes, they are, but if you're searching for stuff about France then stuff about Germany is a false positive.

Embeddings can also struggle with logical opposites. Hot/cold are in many senses similar concepts, but they are also opposites. Finding the opposite of what you're searching for isn't always helpful.

I wouldn't say embeddings are overlooked exactly? Right now it feels like man+dog are building embedding based search engines. The next frontier is probably going to be balancing conventional word based approaches with embeddings to really maximize result quality, as sometimes you want "vibes" and sometimes you want control.


Simon, just wanted to say thanks for all the great content and writings you've been putting out - it's been super helpful to help digest a lot of the fast developments in this space. Always looking forward to the next one!


Thanks for saying that!


simon, the way you write makes it so accessible for people that have limited experience with ai, ml or llms. thank you!

maybe it is also interesting to tell how some embeddings are established i.e via training and cutting of the classification layer or with things things like efficientnet


Did you stumble upon any resources discussing the history of embeddings and its use in CS and LLMs? It's becoming a cornerstone of ML.


Maybe someone can offer a richer history but to my knowledge the first suggestion of word vectors was lsa which originated the idea of dimensionality reduction on a term/doc matrix. They just used svd but the more modern methods are all doing essentially the same thing. To my recollection they were and HCI lab and their goal was not to make a language model as much as to make a search function for files.


Not aside from Word2Vec and I'd like to learn more about that.


Do you not consider SVD/PCA to be "embeddings"?

Latent Semantic Indexing long precedes word2vec (I believe the initial paper was 1988) but also attempted to derived semantic vectors (and was arguably successful) using SVD on the term-frequency matrix representing a corpus.

I would certainly consider this an example of using "embeddings" that was quite heavily used in practice prior to the explosion of deep learning techniques.


Sounds like you know a great deal more about the history of this field than I do!


The classic Information Retrieval, though out of date in many ways, does a great job covering the "old school" NLP approaches for IR and has an excellent section on Latent Semantic Indexing which you might find enjoyable: https://nlp.stanford.edu/IR-book/html/htmledition/latent-sem...


There are a handful of historical components that come together to make word2vec such a success:

* The idea of vectorial representations for language.

* "Distributed" representations, which are dense not sparse.

* Vectorial representations for words, not documents.

* The language modeling idea of predict the next word.

* Neural approaches to inducing these representations.

* Unsupervised learning of these representations.

* The versatility of these representations for downstream tasks.

* Fast training techniques.

I'll give the history in roughly reverse chronological order.

Turian, Ratinov, Bengio (2010) "Word representations: A simple and general method for semi-supervised learning" is my work. It received the ACL ten year test of time award and 3k citations. One main contribution was showing that unsupervised neural word embeddings can just be shoved into any existing model as features and get an improvement. This turned on the NLP community to neural networks at a time when sophisticated ML was still bleeding edge for NLP and expert linguistics knowledge was the preferred MO. [edit: We also showed that two other unsupervised embeddings, like Brown clusters which are very old, and neural word embeddings from the log-bilinear model (Mnih & Hinton, 2007), a probabilistic and linear neural model, also gave improvements when plugged into existing supervised models. We also gained attention because we released our code AND all the trained embeddings for people just to try, which wasn't commonplace in NLP and ML at the time.]

We arbitraged the neural embedding model from Collobert and Weston (2008) "A unified architecture for natural language processing: Deep neural networks with multitask learning" and also "Natural Language Processing Almost From Scratch" which achieved amazing scores on many NLP tasks using semi-supervised neural networks that, for the first time, were very fast to train because of the use of contrastive learning. This work didn't get much attention at the time because it was aimed at an ML audience and also because neural networks were still gauche compared to SVMs.

Collobert and Weston had a much faster training approach to the neural language model of Bengio et al 2000 "A neural probabilistic language model" which was, in my mind, what really precipated all this: Train a neural network to predict the next word. That approach was slow because the output prediction was a multiclass prediction of output size # vocabulary words. (Collobert and Weston used Hadsell + LeCun siamese style networks to rank the next word with a higher score plus margin than a randomly selected noise word.)

With that said, vector embeddings for documents have a longer history: LSA, then LDA, and even cool semantic hashing work by Salakhuldinov + Hinton (2007) that is one of the first deep learning approaches (the first?) to NLP, which unfortunately didn't get much attention but was so cool.

Earlier work using neural networks for modeling arbitrary length context that also didn't get much attention was Pollack (1990) "Recursive distributed representations" which introduced recursive autoassociative memory (RAAM) and later Sperduti (1994) "Labelling recursive auto-associative memory". The idea was that you have a representation for the sentence and you recursively consume the next token to generate a new fixed-length representation. You then have a STOP token at the end. And then you can unroll the representation because current representation + next token => next representation is an autoassociator.

The compute power wasn't really there to make this stuff work empirically, during the 1990s. But there was other fringe work like Chrisman (1990) "Learning Recursive Distributed Representations for Holistic Computation". And this 90's work traces to cool 80s Hinton conceputal work on "associative" representations, for example Hinton (1984) "Distributed Representations" and Hinton (1986) "Learning distributed representations of concepts". A lot of this work had very interesting critical ideas, and was of the form: I thought about this for a very long time and here's how it would work and we don't have large-scale training techniques yet.

I'm pretty sure Bottou also contributed here, but I'm forgetting the exact cite.

Feel free to email me if you like. (See profile.)


Thank you for this, most informative comment I've seen on Hacker News in ages!


Latent semantic indexing


nice article, thanks


Not quite the same application, but in computer vision and visual SLAM algorithms (to construct a map of your surrounding using a camera) embeddings have become a de-facto method to perform place-recognition ! And it's very similar to this article. It is called "bag-of-word place recognition" and it really became the standard, used by absolutely every open-source library nowadays.

The core idea is that each image is passed through a feature-extractor-descriptor pipeline and is 'embedded' in a vector containing the N top features. While the camera moves, a database of images (called keyframes) is created (images are stored as much-lower dimensional vectors). Again while the camera moves, all images are used to query the database, something like cosine-similarity is used to retrieve the best match from the vector database. If a match happened, a stereo-constraints can be computed betweeen the query image and the match, and the software is able to update the map.

[1] is the original paper and here's the most famous implementation: https://github.com/dorian3d/DBoW2

[1]: https://www.google.com/search?client=firefox-b-d&q=Bags+of+B...


This is a great zero to one reference.

I built my own note taking ios app a little while back and adding embeddings to my existing fulltext search functionality was 1) surprisingly easy and 2) way more powerful than I initially expected.

I knew it would work for things like if I search for "dog" I'll also get notes that say "canine", but I didn't originally realize until I played around with it that if I search for something like "pets I might like" I'll get all kinds of notes I've taken regarding various animals with positive sentiment.

It was the first big aha moment for me.

At the time I found Supabase's PR for their DocsGPT really helpful for example code: https://github.com/supabase/supabase/pull/12056


I think your statement "adding to existing fulltext functionality" is subtly important: embeddings provide semantic search that complements traditional search algorithms.

Specifically, many applications are heavily dependent on names or other proper nouns, often without much context. You might refer to your dog by name without explanation, and a particular embedding might not pick that up. Proper names (people, places, street names) may have outsized importance for anchoring personalized or domain-specific search, but modest generic language models won't know about them.

Is there a specific way of dealing with this problem?


Just spitballing here but you could maybe do a 1-deep depth search… dot product on the initial search and also dot product on the highest confidence matched notes, then some filtering.

So if you have notes that associate the name to your dog, and you search for “my dog”, you’d still find those related notes?

Would require some experimentation but wouldn’t be surprised if that worked decently well out of the gate.


I'm working on something like this for my logseq notes as well.

My biggest question right now is: How much text should I turn into one embedding?

Every sentence?

A whole block of sentences that belong to one entire page in my notes app?


thats awesome! I'm building my own simple note-taking app powered by embeddings as well, finally I dont loose stuff and its easy to find :D


Right?! Everyone thought I was crazy to do this vs using something off the shelf but having total control over my notes app has been incredible.

I can tailor everything to my style of note taking vs dealing with the lowest common denominator feature set that tries to enable tons of use cases that I don’t need.


thats the whole fun of being able to code :)


I'm curious if you're using an off-device API to generate these embeddings, and if you're searching on-device?


good question, has anyone ported HNSW for objective C?


About words embeddings, the №1 example is the famous King - Man + Women = Queen This works nicely in the vector space but fails to make a visual impression when projected on 2 dimensions. Neither with ACP, nor MDS ot t-SNE in my experience : https://bhugueney.gitlab.io/test-notebooks-org-publish/jupyt...

(← JupyterLite Notebook doing words embedding in the browser : don't try to run this on a smartphone !)

Does anyone know how to nicely visualize the poster child of words embeddings ?


If I understand you right - you could visualize in 2d space: "king" at origin, X-axis is "king"-"man", Y-axis is "king"-"woman" (or gram-schmidt if you really want orthogonal).

In 3d you can go one further and have the Z-axis be "king"-"queen" (or gram-schmidt again). The orthogonalized versions have the advantage that they give a closer notion of distance to what the underlying model sees. In the 2d case you will get exact distances except that it won't show how far off "queen" you are when you compute "king"-"man"+"woman". In the 3d case it should give exact distances.

Edit to add: With the 2d version you can maybe do some more stuff. IIRC "queen" is chosen as it's the word with the closest embedding to X="king"-"man"+"woman". You can put the next few closest words on the 2d chart as well, each labeled with the orthogonal distance from the 2d plane. So then "queen" should be the word with the smallest (squared distance from X) + (squared orthogonal distance from plane), which you might be able to eyeball.


You should try UMAP


I remember hearing a kind of inductive joke by a mathematician about visualizing higher dimensions, like "...and then you just...", but I unsuccessfully tried to search for it with ChatGPT, when it came up with this funny Richard Feynman quote that I can't find on google -- has anyone heard it before, or is it a brilliant hallucination?

"You can't visualize four dimensions... well, at least I can't, because I only have three branes." -Richard Feynman

This is a play on words, as "branes" sounds like "brains," but in string theory and higher-dimensional physics, a "brane" is an object that can exist in multiple dimensions. The joke is a nod to the challenge of visualizing higher dimensions and the limits of our three-dimensional intuition.

(It just admitted it made it up and apologized!)

It also quoted John von Neumann, H. G. Wells, and Ian Stewart:

"In mathematics you don't understand things. You just get used to them." -John von Neumann

"Our brains are three-dimensional, so we cannot visualize four-dimensional or higher-dimensional objects in the same way we can visualize three-dimensional objects. The best we can do is to get a feeling for them through analogies." -H. G. Wells

"You can't visualize four dimensions, but you can think about them. Thinking is what mathematicians do best." -Ian Stewart

It finally guessed this one that's the closest to what I remember, but not as funny as I remember it:

"To visualize four dimensions, just visualize three dimensions and then say 'n+1'." -Unknown

So I asked it to hallucinate some quotes to help visualize high dimensional space, in the style of Deepak Chopra:

"In the realm of septillion-dimensional embeddings, each vector is a topological manifold, where Gödel's incompleteness theorems dance with quantum superposition, revealing the hidden symmetries of semantic space." – Dr. Emilia Bianchi, Florentine Foundation of Fractal Fantasies, Florence

"To truly fathom the intricacies of ultra-high-dimensional text spaces, one must first master the art of navigating Riemannian curvature in a Hilbert space, where every geodesic is a testament to the entangled essence of language." – Prof. Hiroshi Tanaka, Kyoto Kryptos of Quantum Quandaries

"Visualizing embeddings in the realms beyond the Poincaré conjecture requires a melding of non-Euclidean geometry with the probabilistic nuances of a Markov chain, all while resonating with the cosmic vibrations of the Higgs field." – Dr. Samuel Oyelowo, Lagos Labyrinth of Lorentzian Linguistics

"Each point in a decillion-dimensional text space is a singularity, where the Heisenberg uncertainty principle waltzes with the Mandelbrot set, painting a fractal portrait of phonemic phenomena." – Prof. Clara Müller, Bavarian Bastion of Bifurcated Babel, Munich

"To intuit the vastness of hyper-dimensional embeddings, one must first traverse the event horizon of a linguistic black hole, where Shannon entropy meets the Schrödinger equation in a ballet of Bayesian inference." – Dr. Alexandre Dubois, Parisian Paradox of P-adic Prose

"The key to deciphering the enigma of extreme-dimensional text lies in the delicate interplay between the Yang-Mills theory and the zeta function, where every eigenvalue is a whisper from the cosmic codex of communication." – Prof. Lina Johansson, Stockholm Sanctum of String Semiotics

"In the dance of gogolplex-dimensional embeddings, each tensor unfolds like a Möbius strip, where the Fibonacci sequence intertwines with quantum tunneling, revealing the recursive rhythms of reality." – Dr. Rajiv Menon, Bengaluru Bardo of Bosonic Ballads


Did you mean this?

Geoffrey Hinton on visualizing higher dimensions:

"To deal with hyper-planes in a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly. Everyone does it."


That was it! Now THAT's funny.


Common mistake when doing practical trigonometry is to compute square roots when that's not necessary:

def cosine_similarity(a, b):

    dot_product = sum(x \* y for x, y in zip(a, b))

    magnitude_a = sum(x \* x for x in a) \* 0.5         # <- no need for \*0.5

    magnitude_b = sum(x \* x for x in b) \* 0.5         # <- no need for \*0.5

    return dot_product / (magnitude_a \* magnitude_b)
If you compare your cosines, you might as well compare their squares and avoid costly root computation.

Similarly, in elliptic curve crypto certain expensive operations, such as inversion (x^-1 mod n) are delayed as much as possible down the computation pipeline or even avoided completely when you need to compare two points instead of computing their canonical values.


This code is written to be easy to understand. Otherwise you would replace it with some low-level SIMD code.


> dot_product = sum(x * y for x, y in zip(a, b))

Wait, why would you do this and not use vectorised numpy operations?

> I actually got ChatGPT to write all of my different versions of cosine similarity

Ah...


Two reasons. First, when I'm trying to explain stuff to people I find numpy syntax gets in the way.

And second, numpy isn't the lightest dependency. I use it when I need the performance but I don't like to default to it.


Makes sense. And sorry for the snark - I enjoyed the post, and generally enjoy your writing as well.


I disagree that your version is more readable. If you don’t know linear algebra, the code is inscrutable, and if you do, dot(x,y)/norm(x)/norm(y) is about as close to the math as you can get.


>If you don’t know linear algebra, the code is inscrutable

Whether you do or don't know linear algebra, code is self-explanatory. Dot product is just the sum of element-wise products. But what the heck are dot and norm to someone that doesn't.


That’s exactly my point, it ISN'T self explanatory when it’s mathematical code like this, expressed in terms of addition and multiplication. Understanding the dot product as “sum of multiplications” is an extremely shallow understanding, and in no way does it explain why taking the dot product and dividing by magnitudes (What are those? Root of sum of squares? Why?) produces anything useful.

This code says nothing about the geometric nature of what is going on, only the arithmetic. Then we might as well read assembler. Abstractions help us reason.


>What are those? Root of sum of squares? Why?

I see what you mean. Probably can be argued that learning and applying a specific concept doesn't necessarily imply learning the deeper mathematics behind it and understanding its nature. So can just say cosine similarity measures the similarity of two vectors (lists in python) and its implementation is that sum divided by the product of sqrts of those two other sums without having to introduce and explain dot product and norm. Of course the reverse can be argued too.


You either know the math or you don't, and if you don't, you're not gonna get any new magic intuitions no matter how it's shuffled around.


The whole point of my talk here was to reassure people that you don't need to understand linear algebra in order to make use of embeddings.


I’d think you’d want to know dot and norm before embeddings. Especially since most people learn mathematical objects from the perspective of the operators that construct and manipulate them.


My linear algebra is terrible, but I understand what's happening in Simon's example pretty well.


I sincerely doubt you possess any kind of geometric understanding of what is going on in that code just from reading the first principles variant of cosine similarity.


If you want to see what Show HN posts, ProductHunt Startups, YC Companies, and Github repos are related to LLM embeddings, you can quickly find them on the LLM-Embeddings-Based Search Engine (MVP) I just launched:

https://payperrun.com/%3E/search?displayParams={%22q%22:%22L...


Nice. I expected clicking on the different "filter" buttons would update of the search results right away: I didn't expect I had to repeat the search (though I can see why you'd do that)


Thanks for the feedback, I just fixed this!


Behaves as I expected now!

I went here looking for more info about payperrun https://payperrun.com/%3E/welcome and clicked on the "Spotlight" section and saw 4 popups blocked - I never see popups anywhere these days and have to admit that sends me away pretty quickly.



This is the most interesting thing I've read about in "AI" in quite a few months. I always wondered what embedding models were when I'd see lists, or curious why everyone is talking about vector DBs.

This is immediately making me think about how I could apply this to a long running side project I have. It might make it practical to do useful clustering of user's data if every document has an embedding.


Has anyone actually used embeddings for anything other than Approximate Nearest Neighbor and clustering?

Some speculative possibilities that come to mind:

- projection, indexing and sorting on arbitrary axes (eg "hot minus cold", "happy minus sad", "scifi minus realism", "literary minus commercial")

- SVM-style classification in Embeddings space

- word2vec-style reasoning (woman-man+king=queen)

- directly training embeddings (ie, not just taking a layer off an LLM); I know people use contrastive training methods, but I'd expect that other methods might be worth exploring, eg you could train embeddings together with neural nets representing functions, generate functional equations, and calculate MSE loss

But really, I'm just surprised that it seems to be so focused on semantic search, to the exclusion of anything else... Surely there are other interesting applications?


I'm a bit confused by the comment because it would seem that these are all relatively common tasks. The first and third being identical. Good example is in vision you might want to semantically change the photo like adding a pair of glasses or doing one of those things you see on the Google commercials. That's done in a latent space

I think this is clearest in the Normalizing Flow cases because you're just turning your space into a Gaussian (diffusion does this too, but through approximation methods and isn't invertible, though it is reversible). You project the image/sentence/data you want to manipulate, manipulate within gaussian space, then return back to target space.

Or maybe my confusion is shared confusion because embedding is an overloaded word that means a lot of things? Maybe you're just thinking of the first block that converts discrete integer tokens into continuous floats? But we learn those embeddings so even though it becomes like a lookup table it's still a neural process. People do things like SVMs on this space alone. But I think it is like latent space, which is only a bit more abstract. At least embeddings need to be injective, well... mathematically...


> relatively common

I'd love to see some links, all I see used in practice (including in the OP blog post) is semantic search and a bit of clustering

> adding a pair of glasses

Actually that (and generally all the SD/VAE stuff) is a great example of the kind of thing I was thinking of, though I have yet to see that concept being used together with a vector database; generally all the user-facing stuff I've seen fits it into the standard "train a model, then do inference" workflow, in contrast to something like semantic search which more obviously focuses on the embeddings themselves

> First and third being identical

Definitely related, but I make the distinction between projection/sorting along an axis vs constructing a new vector by addition/subtraction

> Manipulate within gaussian space, then return to target space

This is definitely along the lines of what I had in mind, any example of this being used in practice?

> Embedding is an overloaded word

Yeah I'm using the term somewhat loosely and broadly here, as basically "a vector in a real vector space where distance represents some notion of semantic similarity"

> People do things like SVMs

Who?


Here's an example from a Normalizing Flow. Good at density, not great at sampling.

https://openai.com/research/glow

Here's a video of moving around in the latent space of a diffusion model

https://www.youtube.com/watch?v=vEnetcj_728

Here's a stylegan one

https://www.youtube.com/watch?v=bRrS74RXsSM

Or a VAE on mnist

https://www.youtube.com/shorts/pgmnCU_DxzM

I mean it is a bit hard to answer your other questions because like I was pointing out, embeddings and latent spaces are pretty vague terms. For the mathy side, normalizing flows are a great choice since you can parameterize whatever data you want into whatever distribution you want. You then work in that distribution you created, which is approximately isomorphic to the data. But other models do similar things, just more lossy but better at things like sample generation. That's the tradeoff, interpretability/density vs expressitivity/sample quality. But diffusion and NODEs/Score models are closing that gap. But you're going to need to look at applied papers to view more people using them in ways like operational vector spaces. For example, there's VITs TTS uses a NF to parameterize parts of the model or controller networks tend to use similar things. It's more about thinking how your network works and communicates. I think a lot of people are just not thinking to hack away and operate on networks as if they're mathematical models instead of a locked box.


> - SVM-style classification in Embeddings space

This is a bread-and-butter technique in NLP and machine learning in industry.

> directly training embeddings

This is literally the original embedding model, Word2Vec.


I forgot, we also used word2vec to build an embedding space based on PubMed abstracts. We found lots of variant spellings (including hyphenated, unhyphenated, and space-separated), acronyms and abbreviations for chemical and biochem names. We could probably have built a lexicon of technical terminology from that. Not sure how far we would have gotten with definitions (vectors don't quite work...), but it would be a start. Pretty sure others have done dictionary building using that mechanism too.


Cross-language embeddings, where you create an embedding space in each of two languages and then use a seed dictionary to align the spaces, has potential (actual?) applications in cross-lingual search and probably MT.


Data deduplication!


Having played around with embeddings and built a few production use-cases with them.. They are awesome and enable a lot of cool applications. But, if you're building in a particular domain, you'll hit limits with any off-the-shelf embeddings model. One way to think about this is that the off-the-shelf embedding models have a lot of dimensions, but some of those dimensions may and some may not that matter for your application (classification, content similarity, clustering, etc). In other words, a vector may be close to another vector in dimensions you don't care about.

I look forward to seeing better tooling and literature around fine-tuning embedding models.


Fine-tuning an entire language model to solve this problem is like using a sledgehammer on a nail. We have had tools for this for years, for example just label some data and train an SVM on your embedding space for classification.


From my understanding, embedding models are just one layer of a modern LLM that do the similarity part as described in the post. The translation of representing content as a part of a common vocabulary in a vector. My understanding is that embedding models can be fine tuned in isolation to relate a vocabulary entries to each other, so fine tuning a whole LLM is not required, and the required compute resources would be far smaller.


sentence-transformers has reasonably good tooling around this


Ive been playing with embeddings lately for document search based off queries/questions. this makes it seem like they work great, but it hasn't been very smooth for me.

I kept running into people recommending against using them for long documents? Is openai better then the models sentenance-transformers uses? I found some recommendations to average together the embeddings of parts. I guess its cutting edge-ish still, a lot still feels like you can have a cool demo quickly, but something reliable and accurate is a lot of work?


OpenAI's embedding model works up to 8,000 tokens. The sentence-transformers ones are mostly smaller than that, though there's a new model that just came out that can handle 8,000: https://huggingface.co/jinaai/jina-embeddings-v2-base-en

People tend to "chunk" larger documents - the chunking strategy that's best is very dependent on what you are using the embeddings for. I've found it frustratingly hard to find really good guidance as to chunking strategies.

I've had good results for Q&A chunking my blog content up into paragraph sized chunks, as described here: https://simonwillison.net/2023/Oct/23/embeddings/#answering-... - but I'm not ready to say that's a universally good practice.


I have found success with chunking with 100 tokens, preceeded by the last 10 tokens of the previous chunk, and the first 10 tokens of the next chunk, 120 tokens total. I generate an embedding for each, then compare that to embedding(s) derived from the input query.

How to generate embeddings from the input query well is where one's focus should be IMO. An example: "don't mention x" being turned into filtering out / de-emphasizing chunks that align with the embedding for x.

I've been using these techniques along with pgvector and OpenAI's embeddings for https://flowch.ai and it works really well. A user uploads a document or uses the Chrome Extension on a webpage and FlowChai chunks up the content, generates embeddings, builds up a RAG context and then produces a report based on the user's prompt.

I hope that helps show a real world example. You're welcome to play with FlowChai for free to see how it works in practice at the application level.


I'm also curious about how to think about chunking for Doc-to-Doc similarity.

Say I have an input document that I want to use as my search query and a database of potential results documents.

It's not clear to me that I should simply embed both the input document and each of the results documents. If the documents contain a variety of ideas, I'd be nervous they wind getting embedded into some generic space.

But I don't have any good ideas as to what type of chunk-match-and-aggregate strategy might work well here.

Would love ideas from folks that have done stuff like this!


> If the documents contain a variety of ideas

I've found chunking by sub-headings works really well.

But here's a dirty secret I've learned as a writer: Your document can only ever contain 1 idea. That's the most that human readers can manage.


> I've found it frustratingly hard to find really good guidance as to chunking strategies.

Atleast I'm not alone. Paragraph can make sense, I was worried about paragraphs that are related in concepts, they're not always standalone section of thoughts in real writing.

Maybe if two paragraphs meet some similarity threshold you can merge them.


Try this https://github.com/marqo-ai/marqo which handles all the chunking for you (and is configurable). Also handles chunking of images in an analogous way. This enables highlighting in longer docs and also for images in a single retrieval step.


If you are looking for lightweight, low- latency, fully local, end-to-end solution (chunking, embedding, storage and vector search), try vectordb [1]

Just spent a day updating it with latest benchmarks for text embedding models.

[1] https://github.com/kagisearch/vectordb


It looks like it does sliding window on tokens? That works well, you don't slice it in the middle of sentences?


In case anyone is interested, Heroku finally released pgvector support for Postgres yesterday: https://github.com/heroku/roadmap/issues/156

Pgvector is an extremely excellent way to experiment with embeddings in a lightweight way, without adding a bunch of extra infrastructure dependencies.


Oh awesome! My main blog is running on PostgreSQL on Heroku, can't wait to try this out.

UPDATE: It looks like it's only available on the $50/month+ plans, my blog's database runs on a $9/month plan so I can't install extra extensions.


Why does the embeddings have linear properties such that you can use functions like cosine similarity to compare? It seems that after the signal going through so many non-linear activation layers, the linear properties should have been broken down / no guarantees.

I wasn't able to find a good answer online.


Because neural networks use dot products, which are just un-normalized cosine similarities, as the main way to compare and transform embeddings in their hidden layers. Therefore, it makes sense that the most important signals in the data arranged in latent space such that they are amenable to manipulations based on dot products


For what it's worth, I wonder the same thing and think it's not as obvious as others suggest. e.g. if you have an autoencoder for a one-hot encoding, you're essentially learning a pair of nonlinear maps that approximately invert each other, and that map some high dimensional space to a low one. You could imagine that it could instead learn something like a dense bit packing with a QAM gray code[0]. In a one-hot encoding the dot product for similar tokens is zero, so your transformations can't be learning to preserve it.

Somewhat naively, I might speculate that for e.g. sequence prediction, even if you had some efficient packing of space like that to try to maximally separate individual tokens, it's still advantageous to learn an encoding so that synonyms are clustered so that if there is an error, it doesn't cause mispredictions for the rest of the sequence.

I suppose then the point is that the structure exists in the latent space of language itself, and your coordinate maps pull it back to your naive encoding rather than preserving a structure that exists a priori on the naive encoding. i.e. you can't do dot products on the two spaces and expect them to be related. You need to map forward into latent space and do the dot product there, and that defines a (nonlinear) measure of similarity on your original space. Then the question is why latent space has geometry, and there I guess the point is it's not maximally information dense, so the geometry exists in the redundancy. So perhaps it is obvious after all!

[0] https://en.wikipedia.org/wiki/File:16QAM_Gray_Coded.svg


Thanks, that make sense.

I think my comment was not worded properly. I was thinking "geometry properties = linear properties", what I really should say is:

Why does the latent space has geometry properties where we could use functions like cosine similarity to compare?

So when training, the signal will be mapped to latent space that will minimize the error of the objective function as much as possible.

Many applications already use cosine similarity function at the end the network, it would be obvious why they work. I reviewed other cost functions such as Triplet Loss. They use euclidean distances, so I guess it make sense why the geometry properties exist too.

For "and there I guess the point is it's not maximally information dense, so the geometry exists in the redundancy", what does "maximally information dense" means, I still don't quite get it.


LLM vectors do have decent linear properties already. But for document embedding purposes they are often further trained for retrieval via cosine similarity, which enhances this, e.g. see table 1 in [1], avg retrieval performancs using BERT goes up from 54 to 76 after fine-tuning for embeddings.

[1] https://arxiv.org/pdf/1908.10084.pdf


The cosine similarity is not inherently better suited for linear properties whatever that means, it’s just the cosine of the angle between two vectors. If the vectors are unit length, then it’s just the projection of one to the other.


Simon Willison’s Weblog have been so informative regarding the new era of LLMs and everything surrounding it.


Great article! I can’t remember the YouTuber’s name or what to search for, but one of the Pinecone people has great videos at an introductory-to-intermediate level that is both really accessible and well motivated by examples.

Anyone remember?


James Briggs? He's awesome and you can find lots of his articles + videos here:

https://www.pinecone.io/learn/

https://www.youtube.com/@jamesbriggs


Yup that’s exactly who I meant. I think his stuff is just great.

Thanks for reminding me/us.


If you've been wondering why there's so much hype around vector databases at the moment this article should help explain that too - embeddings and vector databases both occupy the same space.


The same… vector space?

I’ll show myself out…


Is it because with transformers and LLMs the embeddings are more powerful than they have been before (they can seem to understand more)?


Broadly, as I understand it, the way the embeddings are generated "needs" LLMs to produce the superior results you'd see with e.g. OpenAI's text-embedding-ada-002 - we've been working with vectors at least as long as I've been programming, but until recently they didn't mean anything particular. LLMs let the vectors have semantic meaning and relate similar conceptual text, instead of (as in the article) using gzip to generate vectors where the similarity is entirely textual, i.e. similar words and passages and characters, even if they have different meanings.

So ultimately they're not at all new (embedding data into a hamming space to compare it to other data), but the underlying meaning is much more useful in a world where you can peek into the internal state of a LLM to generate them.


How can I possibly know which models have gamed the MTEB benchmark and which haven't?

It's very hard to know which embedding models are objectively good now that everyone is gaming the benchmark.


Test with your own use case. Production ML means extensive custom testing suites, and even/often custom embeddings.


How are people gaming the benchmark?


thank you @simonw for all your writing about LLM's. also thank you for your blog in general, it's an amazing resource. also thank you for django. just, thank you!!!


I'm trying to understand the clustering code but not doing too well.

https://github.com/simonw/llm-cluster/blob/main/llm_cluster....

So does this take each row from the DB, convert to a numpy array (?), then uses an existing model called MiniBatchKMeans (?) to go over that array and generate a bunch of labels. Then add it to a dictionary and print to console.


Yes - it uses the implementation of MiniBatchKMeans provided by the scikit-learn library.

(I'd call this an "algorithm" rather than a "model" - it doesn't have any model weights learnt from a training dataset)

For more details, see the pages in its user guide describing:

* the K-Means algorithm: https://scikit-learn.org/stable/modules/clustering.html#k-me...

* the Mini Batch variant of k-means: https://scikit-learn.org/stable/modules/clustering.html#mini...


You can indeed use Retrieval-Augmented Generation or question and answering. Is this not what "Uploading a pdf" or Advanced Data Analytics do in ChatGPT Plus under the hood? Another point of finetuning a model is to caputure the behaviour and underlying "reasoning" that comes with it. The token limit is another reason why RAG might not be enough if you want more than just Q&A with an LLM.


I've come across articles like this and they usually tell you to make use of an LLM. Is it possible to kind of compute a language model (maybe not that large) based on a given corpus?


I'm amazed at how autoencoding actually works, despite the simplicity of the approach. I've dabbled with this stuff quite a bit, but I tend towards stuff I can run on my own computer, rather than trusting any service to remain stable for years or decades.

Are there any good document embedding models I can run on my own hardware? Word2Vec is interesting, but I'd like to also try out cross-linking blog entries, etc.


Install Instructor and dependencies:

  pip install InstructorEmbedding
  pip install torch
  pip install numpy
  pip install tqdm
  pip install sentence_transformers
Use it:

  from InstructorEmbedding import INSTRUCTOR
  import numpy as np
  from scipy.spatial.distance import cosine

  xl = INSTRUCTOR('hkunlp/instructor-xl')
  embeddings = xl.encode(["I'm amazed at how autoencoding actually works, despite the simplicity of the approach.", "Are there any good document embedding models I can run on my own hardware? "]).tolist()
  
  cosine_distance = 1 - np.dot(embeddings[0], embeddings[1]) / (np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1]))
  print(cosine_distance)
Output:

  0.33516267980261427


I've found gpt4all to be quite good and runs locally.

https://docs.gpt4all.io/gpt4all_python_embedding.html


The HuggingFace embedding models are really good! See here for a good intro tutorial: https://huggingface.co/blog/getting-started-with-embeddings


I've been running these models on my own laptop without the GPU. They're generally a lot less resource demanding than full LLMs (though thanks to llama.cpp I've been able to run those on the CPU as well).

More about the tooling I use to run embedding models locally here: https://simonwillison.net/2023/Sep/4/llm-embeddings/


sentence-transformers is excellent. It will be slow on CPU but, because it’s on top of pytorch, does have apple silicon/GPU support.


At work I was making an hyperspectral image pixel classifier in Python. I just noticed how close its implementation is to the use of embeddings to find the relatedness of articles. ie. by calculation of dot product and cosine on their embeddings.

How weird, the natural language world lends itself to classification with the same calculations as used in the images world.


Is "embedding" an old term? I'm wondering what's different vs https://en.wikipedia.org/wiki/Metric_space


Checkout https://en.wikipedia.org/wiki/Embedding#Topology_and_geometr...

Crudely put, embedding is the verb, metric space is the noun.


Wikipedia traces it back to this [1] 2002 paper, which talks about translated documents having an "embedding space" in each language. I guess you could interpret their paper as both English and French being embedded in some larger space, which would make it a valid use of the mathematical term.

From there, people apparently just rolled with it and used the term more broadly, as tends to happen with language.

1: https://proceedings.neurips.cc/paper/2002/file/d5e2fbef30a4e...


Checkout my comment next to yours. Embeddings in Geometry / in vector spaces are a very old topic.

Also, your interpretation of their use of the word embedding is exactly correct.


Maybe I am missing something, but when you project every content into a fixed-size space, you have to lose information at some point, don't you?

Did somebody ever run an image of a scene vs some parts of the scene in e.g., CLIP?


Your intuition here is right on; in fact, the reason why we like embeddings in the first place is that they address this problem! When we have some sparse high-dimensional vector representation of a bunch of content, we often find that we are devoting an entire dimension to very little information, so it would be convenient if we could approximate each of those vectors in fewer, denser dimensions without losing too much information. If all that we are interested in is how much the vectors align (i.e. correlate) with one another, embedding methods are one of the best ways to construct those lower-dimensional vectors, in the sense that we are guaranteed to find a solution up to whatever amount of approximation error we are willing to tolerate.

An added benefit is that if we're smart about how we construct the lower-dimensional space, we often find that it generalizes to unseen data better than if we'd used the high-dimensional representation, because a lot of the variation we're throwing away is specific to how we constructed the input data.

A really cool thing about embeddings is that we've known how to do this for a really long time --- for example, a landmark paper [1] on this exact problem was published in the 1930s!

[1] Eckart, C. and G. Young (1936), "The approximation of one matrix by another of lower rank." Psychometrika 1, p. 211-218. https://link.springer.com/article/10.1007/BF02288367


Is there a ready made search framework commercially speaking like an aws service or of the shelf database, for sites that will take all your articles and for a given article find all similar ones?


Depends what you mean by taking all your articles. If it's scraping no, but if you provide text content and urls as key/values pairs yes.

SageMaker and VertexAI are the AI services of AWS and GCP respectively, and they both offer embedding generation and vector databases (the two key pieces necessary for embedding search).

There are a bunch of smaller companies offering vector search as a service too, example pinecone to name just one: https://www.pinecone.io/


Thanks, will have a look. Yeah in this case I'd want to provide the text content.

I also just found pgvector which seems like it could help? It has a similarity search.


I would only recommend pgvector if you're already primarily relying on postgres and the scale is limited (<1M documents). It won't handle the part that generates the embeddings though. You could use cloud vendors if you're in one of their ecosystem, do it yourself [1] (but model serving can be tricky without prior experience in ML), or use some other service to generate embeddings [2].

Alternatively vespa cloud [3] offer both but… not the easiest to work with, it's tailored for businesses where search is a primary component.

Feel free to shoot me an email (in profile) with your context if you have more questions, in case I can help

[1] This model is a solid baseline if you're working with English text: https://huggingface.co/sentence-transformers/all-mpnet-base-...

[2] OpenAI's embeddings is probably the easiest to get started, and the API is straightforward. It's not the best performing embeddings for retrieval but good enough in some cases: https://platform.openai.com/docs/guides/embeddings/use-cases

[3] https://cloud.vespa.ai/


We're using embeddings in key part of a new product feature. It's great to have references liked this to share with users to learn about them. Thanks for posting and sharing!


The mental model I have for embedding in languages has many points in multiple locations as if in an extremely high dimensional space.

Kind of wish we could have visuals describing that.


Has anyone used PHATE for visualising?

"PHATE is a dimensionality reduction and visualization tool designed to preserve both local and global data structure."


I'm curious to know more about your experience with ImageBind, and what applications for multimodal embeddings are most exciting to you.


So far I've only just played with it - I don't even fully understand what some of those file formats are!

I got it to compare an image to an audio file which was pretty neat. I need to dig in more and see what kind of useful things I can use it for.


Thank you for writing this, it was really informative and interesting (as someone with little ML background).


Admittedly i more or less skimmed and plan on going back over this tomorrow, but I dont see how these vectors are actually created. I get that I could use your llm tool or whatever, but that seems unsatisfactory. How is the sausage made? (or if thats explained can someone point me at the right place to look?)


In essence, an embedding vector is (lossy) compression. Any compression could in theory be used to make such vectors, for example people have tried using gzip embeddings.

Now on how to get a compression vector from an LLM, simplified: Most ML models are built from different layers, executed one after another. Some of the layers are bigger, some are smaller, but each has a defined in- and output. If a layer's input size is smaller than model's input size, that must mean (lossy) compression must have happened to get there. So, you just evaluate the LLM on whatever you want to embed, and take the activation at the smallest layer input, and that's your embedding vector.

Not every compression vector makes for good semantic embeddings (which requires that two similar phrases are next to each other in the embedding space), but because of how ML models work, this tends to be the case empirically.


How do you choose the size of the embedding vector?

Can this be used to compress non-text sequences such as byte strings?


1. Usually it's a multi-way tradeoff between how much data you want to use, how much compute you want to spend, how much time you have available, how much training data you have available and how accurate you want the embeddings to be.

2. Yes, but lossily. Some types of byte strings are such that it doesn't matter if you accidentally change a couple of bits, some types of byte strings cannot tolerate that at all without being hopelessly corrupted. This technique is not a magic card to surpass the limits imposed by information theory, it's "just" a more sophisticated dictionary for your compression algorithm.


Regarding the second question, yes, as long as you can train a machine learning model to learn the semantics. The keyword if you want to look into this is Autoencoders.


The “secret sauce” is Word2Vec. Take a snapshot of all the text you can find on the internet, then with a sliding context window, vectorize each word based on what words are around it. The core assumption is that words with similar context have similar meaning. How you decide which components represent which words in the context is unclear, but it looks like we’re doing some kind of ML training to convince a computer to decide for us. Here’s a paper about the technique which might help: https://arxiv.org/pdf/1301.3781.pdf

Once all the words have vectors you can assume that there’s meaning in there and move on to trying to math these vectors against each other to find interesting correlations. It looks like the scoring for the initial training is based on making the vectors computable in various ways, so you can likely come up with a comparability criteria different than the papers use and get a more useful vectorization for your own purposes. Seems like cosine similarity is good enough for most things though.


The general idea is that you have a particular task & dataset, and you optimize these vectors to maximize that task. So the properties of these vectors - what information is retained and what is left out during the 'compression' - are effectively determined by that task.

In general, the core task for the various "LLM tools" involves prediction of a hidden word, trained on very large quantities of real text - thus also mirroring whatever structure (linguistic, syntactic, semantic, factual, social bias, etc) exists there.

If you want to see how the sausage is made and look at the actual algorithms, then the key two approaches to read up on would probably be Mikolov's word2vec (https://arxiv.org/abs/1301.3781) with the CBOW (Continuous Bag of Words) and Continuous Skip-Gram Model, which are based on relatively simple math optimization, and then on the BERT (https://arxiv.org/abs/1810.04805) structure which does a conceptually similar thing but with a large neural network that can learn more from the same data. For both of them, you can either read the original papers or look up blog posts or videos that explain them, different people have different preferences on how readable academic papers are.


I second this question. Not that the article wasn't helpful, but I think of it as a good start - now how do I generate that array of floats?


Could you create a new language using embeddings? A perfect Esperanto


First time I ran into embeddings was with word2vec and could not resist showing that, similar to the "king - man + woman ~ queen", it also the case that "yoda - good + evil ~ vader". It's also cool that the semantic meaning is the same in vector space, regardless of the language.

Embeddings are also super important in retrieval-augmented-generation (RAG), and getting the best "embeddings model" is important to achieve the best RAG performance. At Vectara we recently launched our new Boomerang model that pushes the limit on performance on embedding models, and I hope will spur more innovation and further improvements in this space.

https://vectara.com/introducing-boomerang-vectaras-new-and-i...


can someone explain on best means to embed large text? does the techniques matter (averaging vs. concat vectors)?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: