I built multiple systems using vector search, one of them demoed in a search engine for non-commercial content at http://teclis.com
Running vector search (also sometimes referred to as semantic search, or a part of semantic search stack) is a trivial matter with open-source libraries like Faiss https://github.com/facebookresearch/faiss
It takes 5 minutes to set up. You can search billion vectors on common hardware. For low-latency (up to couple of hundred milliseconds) use cases, it is highly unlikely that any cloud solution like this would be a better choice than something deployed on premise because of the network overhead.
(worth noting is that there are about two dozen vector search libraries, all benchmarked at http://ann-benchmarks.com/ and most of them open-source)
A much more interesting (and harder) problem is creating good vectors to begin with. This refers to the process of converting a text or an image to a multidimensional vector, usually done by a machine learning model such as BERT (for text) or ImageNet (for images).
Try entering a query like 'gpt3' or '2019' into the news search demo linked in the Google's PR:
The results are nonsensical. Not because the vector search didn't do its job well, but because generated vectors were suboptimal to begin with. Having good vectors is 99% of the semantic search problem.
This area of research s fascinating. For those who want to play with this more, an interesting end-to-end (including both vector generation and search) open-source solution is Haystack https://github.com/deepset-ai/haystack
There are two huge things your 5 minute setup is missing which are very hard techinically to tackle
1. Incrementally updating the search space. Not that easy to do, and becomes more important to not just do the dumb thing of retraining the entire index on every update for larger datasets.
2. Combining vector search and some database-like search in an efficient manner. I don't know if this Google post really solves that problem or if they just do the vector lookup followed by a parallelized linear scan, but this is still an open research/unsolved problem.
Spot on! Both of those were motivating factors when building Weaviate (Open Source Vector Search Engine). We really wanted it to feel like a full database or search engine. You should be able to do anything you would do with Elasticsearch, etc. There should be no waiting time between creating an object and searching. Incremental Updates and Deletes supported, etc.
Correct, that would take more than 5 minutes, although still possible to do with Faiss (and not that hard relatively speaking - in the Teclis demo, I indeed did your second point - combine results with a keyword search engine and there are many simple solutions you can use out there like Meilisearch, Sonic etc.e). If you were to try using an external API for vector search, you would still need to build keyword based search separately (and then combining/ranking logic) so then you may be better off just building the entire stack anyway.
Anyway, for me, the number one priority was latency and it is hard to beat on-premise search for that.
Even then, a vector search API is just one component you will need in your stack. You need to pick the right model, create vectors (GPU intensive), then possibly combine search results with keyword based search (say BM25) to improve accuracy etc. I am still waiting to see an end-to-end API doing all this.
>then possibly combine search results with keyword based search (say BM25) to >improve accuracy etc. I am still waiting to see an end-to-end API doing all this
>
> I am still waiting to see an end to end API doing all this
That’s kinda the idea of Weaviate. You might like the Wikipedia demo dataset that contains all this. You indeed need to run this demo on your own infra but the whole setup (from vector DB to ML models) is containerized https://github.com/semi-technologies/semantic-search-through...
Exactly right. Things like data freshness (live index updates), CRUD operations, metadata filtering, and horizontal scaling are all “extras” that don’t come with Faiss. Hence the need for solutions like Google Matching Engine and Pinecone.io.
And even if you do just want ANN and nothing else, some people just want to make API calls to a live service and not worry about anything else.
Can you expand more or provide a concrete example for the second point? What kind of database-like searches are you thinking about for spatial data? Things like range-queries can already be (approximately) done. Or are you thinking about relational style queries on data associated with each point?
Yes exactly, relational style queries with each data point. Maybe you have some metadata about your images and maybe you need to join against another table to properly query them. But at the same time you want to only grab the first k nearest neighbors according to vector similarity.
I agree having a good vector is important to start with. However this is not very hard to make it work, you only need to finetune some of the clip models[1] to run it well.
Disclose: I have built a vector search engine to proof this idea[2]
I just made a couple of searches with teclis. I have to say, it's not bad. It's clearly not complete and I get several empty searches. But the content of the results are of higher quality than what I get with Google or DDG. Nice work!
Thanks. The index is tiny and it is just a proof of concept of what a single person can do with technologies available nowadays. I felt it is better for it to return zero results than bad results.
As the site says this demo is by no means meant as a replacement for Google, but rather to complement it. I would say Teclis is good for content discovery and learning new things outside the typical search engine filter bubble. A few examples of good queries are listed on the site.
Not the author, but at work we've had in the hundreds of millions. Faiss can certainly scale.
If you do have a tiny index and want to try Google's version of vector search (as an alternative to Faiss), you can easily run ScaNN locally [1] (linked in the article, that's the underlying tech). On small scale I had better perf with ScaNN
This demo is only about million vectors. The largest I had in Faiss was embeddings of the entire Wikipedia (scale in the neighborhood of ~30 million vectors). I know people running few billion vectors in Faiss.
So one vector per article? Doesn’t this skew results? A short article with 0.9 relevance score would rank higher than a long article containing one paragraph with 1.0 relevance. Am I mistaken?
Also, BERT on cheap hardware? I thought that without a GPU, vectorizing millions of snippets or doing sub-second queries was basically out of the question.
CPU BERT inference is fast enough to embed 50 examples per second. Your large index is built offline, the query is embedded live but it's just one at a time. Approximate similarity search complexity is logarithmic so it's super fast on large collections.
It's about choosing the right Transformer model, there are several models which are smaller, with fewer parameters than bert-base which gives the exact same accuracy as bert-base, which you can use on a modern CPU single digit ms, even with a single intra-thread. See for example, https://github.com/vespa-engine/sample-apps/blob/master/msma...
I compared BERT[1], distilbert[2], mpnet[3] and minilm[4] in the past. But the results I got "out of the box" for semantic search were not better than using fastText, which is orders of magnitude faster. BERT and distilbert are 400x slower than fastText, minilm 300x, and mpnet 700x. At least if you are using a CPU-only machine. USE, xlmroberta and elmo were even worse (5,000 - 18,000x slower).
I also love how fast and easy it is to train your own fastText model.
Vector models are nothing but representation learning and applying the model out-of-domain usually gives worse results than plain old BM25. See https://arxiv.org/abs/2104.08663
A concrete example is DPR which is a state of the art dense retriever model for wikipedia for question answering, when applying that model on MS Marco passage ranking it performs worse than plain BM25.
>A much more interesting (and harder) problem is creating good vectors to begin >with.
Indeed, this is the hardest problem. Vector search shines when used in-domain using deep representation learning, for example bi-encoders on top of transformer models for text domain.
However, these models does not generalize well when used out of domain as the representations changes. Hence, in many cases, simple BM25 beats most of the dense vector models when used in a different domain. See https://arxiv.org/abs/2104.08663
Do you have a starting point for generating useful vectors for full text search? To simplify, I have a 350k document academic database, with docs of maybe 500-5000 words. I would like to be able to provide similarity results as a form of "concordance".
Do sentence, paragraph, or full doc vectors work the best? Do things work better creating vectors from sentence vectors for longer sections, or have you had better results making longer vectors directly? Do different vectorizing algorithms work better on different sized sections of text?
I have not had much luck finding discussion or write ups on this type of stragetizing on chunks and algorithms to apply, beyond just people saying that yes that's how it's done.
Depends on what you want your search results to be. Do you want a list of documents, sorted by relevancy? Or a list of sentences? Or paragraphs? Personally, I find paragraphs a good middle ground. However, if many of your documents contain very long paragraphs, it would still be tedious to go through your search results. So what might work better in this case would be to slide a window of, say, 1-3 sentences over each document. This way, you'll be able to find the most relevant 1-3 sentence snippet, both within one document, and within your entire corpus.
That is definitely part of the thought, and I am tending to thinking that there needs to be more than one index, to be able to compare both sentence level and larger sections on different tabs, since the comparisons are quite different in nature (avoiding hard lines, it is still the case that sentence comparison is closer to grammatical and linguistic dependency, whereas longer passages will tend to reflect deeper thematic relationships).
What I was attempting to ask was more on how to effectively compare longer sections of text. It sounds like you are saying it is fine to compare them directly. So, due to very, very irregular formatting between texts that will likely never change, we would probably be stuck using only sentences and 'documents' (an article, a chapter of a book, etc).
So my questions, from a technical perspective, are:
1) do you find you get better results treating these longer texts as a series or words or a series of sentences, and should we be performing a double vectorization, were we vectorize all sentences, and then vectorize the longer text as a series of sentence vectors instead of using a word vectorization technique directly?
2) Assuming we have decided we want to vectorize sentences and longer sections of texts regardless of whether we use sentence vectors for the latter, do we use the same or different vectorization techniques/algorithms for both tasks, and do you have any recommendations on specific algorithms for either case?
I'm quite new to the field myself. Many more experienced people, including the OP, I believe, would suggest that you use sentence transformers for this task. I personally don't understand how those are trained or fine-tuned, and I have never done it myself so far. What I do know is that sentence transformer results are horrible if you use them out of the box on anything other than the domain in which they were trained.
There's also the question of compute and RAM necessary to generate the embeddings. In one of my own projects, vectorizing 40M text snippets with sentence transformers would have taken 30 days on my desktop. So all I have worked with so far is fastText. It's several hundred times faster than any BERT model, and it can be trained quite easily, from scratch, on your own domain corpus. Again, this may be an outdated technique. But it's the one thing where I have some practical experience. And it does work quite well for creating a semantic search engine that is both fast and does not require a GPU.
The problem with fastText is that it only creates embeddings for words. You can use it to generate embeddings for a whole sentence or paragraph - but what it will do internally in this case is to just generate embeddings for all the words contained in the text, and then to average them. That doesn't give you really good representations of what the snippet is about, because common words like prepositions are given just as much weight as the actual keywords in the sentence.
Some smart people, however, figured out that you can pair BM25 with fastText: With BM25, you essentially create a word count dictionary on your corpus, that tells you which words are rare, and therefore especially meaningful, and which are commonplace. Then, you let fastText generate embeddings for all the words in your snippet. But, instead of averaging them with equal weights, you use the BM25 dictionary as your weights. Words that are rare and special thus get greater influence on the vector than commonplace words.
FastText understands no context, and it does not understand confirmation or negation. All it will do is find you the snippets that are using either the same words, or words that are often used in place of your query words within your corpus. But I find that is already a big improvement on mere keywords search, or even BM25, because it will find snippets that use different words, but talk about the same concept.
Since fastText is computationally "cheap", you can afford to split your documents into several overlapping sets of snippets: Whole paragraphs, 3 sentence, 2 sentence, 1 sentence, for instance. At query time, if two results overlap, you just display the one that ranks the highest.
Personally, I would imagine that doing document-level search wouldn't be very satisfying for the user. We're so used to Google finding us not only the page we're looking for, but also the position within the page where we can find the content that we're looking for. With a scientific article, it would be painful to having to scroll though the entire thing and skim it in order to know whether it actually is the answer to our query or not.
And with sentence-level, you'd be missing out on all context in which the sentence appears.
As the low hanging fruit, why not go with the units chosen by the authors of the articles, i.e. the paragraphs. Even if those vary wildly in length, it would be a starting point. If you then find that the results are too long, or too short, you can make adjustments. For snippets that are too short, you could "pull in" the next one. And for those that are too long, you could split them, perhaps even with overlap. I think for both the human end user and most NLP models, anything the length of a tweet, or around 200 characters, is about the sweet spot. If you can find a way to split your documents into such units, you'd probably do well, regardless of which technology you end up using.
You can also check out Weaviate. If you use Weaviate, you don't have to worry about creating embeddings at all. You just focus on how you split and structure your material. You could have, for instance, an index called "documents" and an index called "paragraphs". The former would contain things like publication date, name of the authors, etc.. And the latter would contain the text for each paragraph, along with the position in the document. Then, you can ask Weaviate to find you the paragraphs that are semantically closest to query XYZ. And to also tell you which articles they belong to.
You can also add negative "weights", i.e. search for paragraphs talking about "apple", but not in the context of "fruit" (i.e. if you're searching for Apple, the company).
These things work out of the box in Weaviate. Also, you can set up Weaviate using GloVe (basically the same as fastText), or you can set it up using Transformers. So you can try both approaches, without actually having to train or fine-tune the models yourself. With the transformers module, they also have a Q&A module, where it will actually search within your snippets and find you the substring which it thinks is the answer to your question.
I have successfully run Weaviate on a $50 DigitalOcean droplet for doing sub-second semantic queries on a corpus of 20+M text snippets. Only for getting the data into the system I had to use a more powerful server. But you can run the ingestion on a bigger server, and when it's done, you can just move things over to the smaller machine.
Thank you for the write up, since as far as my research has gone that is the best description of how to go about planning for vectorization I have seen. Undoubtedly experimentation on our corpus is required, but it helps to have an overview so we don't wildly run down the wrong paths early on.
Exactly. Training models, generating embeddings and building databases all can take days to run, and hundreds of Dollars in server costs. It’s painful to have done all that only to realize that one has gone down the wrong path. It pays to test your pipeline with a smaller batch first. Most likely you will discover some issues in your process. This iterative cycle is much faster if you work with a smaller test set first and do not switch over to your main corpus prematurely.
I recommend readers take parent post with a grain of salt.
(1) Google's offering returns with-in <5ms, in my experience.
(2) the demo is for paragraphs, not short text. You're putting mismatched data into the input, of course it's not going to work. Try a paragraph as suggested.
Hmm.. There is no web service that returns response in <5ms, unless you are sitting at the very terminal of the hardware producing the output.
The demo featured in this PR takes about 800-1000ms total to produce search results. How much of that is the actual API is not known. Typically, https request to an API in the cloud will cost you at least 50ms of network latency, more likely 100ms-200ms. If you are running vector search on premise you will obviously not have this overhead.
Text embeddings typically work for short text as well as paragraphs (paragraph embeddings are usually mean/max of word embeddings anyway) simply because most commercial use cases demand handling of short text input (because nobody is inputting a paragraph into a typical search box; what use is a news search if you can not type a single word like 'Biden' or 'gpt3' into it).
> Hmm.. There is no web service that returns response in <5ms, unless you are sitting at the very terminal of the hardware producing the output.
Typical Google Cloud Bigtable P(50) is <= 4ms to a GCE VM in the same region.
Source: I work on Google Cloud Bigtable.
EDIT: I should clarify that applies to point reads on an established connection using gRPC-over-HTTP/2 from e.g. Java/golang/C++ client, and doesn’t include cold connection set-up / uncached auth / TLS negotiation, but those are well into P(99.999) territory if you’re doing millions of QPS (which applies to most of our large customers).
Also pretty sure customers using hosted Redis / Memcached offerings like Google Cloud Memorystore or AWS Elasticache etc. get sub-1ms times for service-to-VM responses. Those aren’t over HTTP of course… so perhaps they don’t count as “web services”?
I think you are talking about latency over the equipment in the same datacenter/intranet here which, yes, I do not count as web services but are more akin to 'sitting at the terminal'. If I could access Google Cloud Bigtable over the web from my home Mac at <5ms, that would of course be the biggest thing since the invention of the web :)
It's a cloud offering, so the machines are located near each other. Using it with other cloud services is a fair comparison to running it in the same box on-prem.
The offering is similarity search, not a search engine. They offer image to image as another comparison point.
With the caveat of having to use GCP to host your server too, I can agree with you (although 5ms still sounds incredibly low, how many vectors was that?).
I was obviously talking about a general use case where a user considers using an API like this vs running Faiss, and their server can be anywhere (a use case that is more common to me personally).
We are only talking about backend performance so I don't really think it's fair to include network latency on this metric, it can also hugely varied based on your current network/location setup
> For low-latency (up to couple of hundred milliseconds) use cases, it is highly unlikely that any cloud solution like this would be a better choice than something deployed on premise because of the *network overhead*.
>Try entering a query like 'gpt3' or '2019' into the news search demo
The results are bad because this is not the kind of input the engine was trained to handle.
You should copy paste a part of an existing article. It will embed it into a multidimensional space and do a similarity search (the same way it was trained to (by converting a full article paragraph to a vector).
If you give it just a word it can't convert it to a meaningful representation because the network wasn't trained to do this.
But you can train it differently and have it able to handle a few words. For example you can summarize every article to a few sentences and keywords and use a traditional keyword search.
One usual way to create good vector representation is to use encode simultaneously two different space to the same vector space. You encode 'queries' and 'answers' such that they are close for the known (query-answer) pairs. This is what CLIP did, encoding both images and their corresponding description to a same vector space.
You can download the precomputed clip embeddings LAION-400-MILLION OPEN DATASET on academictorrents.com
CLIP can do such thing for the problem of semantic image search because the problem of matching an image to its description is quite well defined. But quite often there is no unique apriori meaningful way of matching a query to an answer, specially as the index get big.
In the case of a basic query 'gpt-3', the query is quite vague and its not obvious with respect to which direction you should do the ranking (Do you meaning you want articles generated by gpt-3 ? Articles containing gpt-3, a basic definition ?). There is no a priori good answer, and that's where you can use your additional context to refine the query. For example Siri or a NLP bot, could ask you to be more explicit in what you mean.
Or it can have multiple representation of your vector space and return the top-1 for each of those representation, and hope that you give it feedback by clicking the one that was more meaningful to you as requester.
Thanks for the reference to haystack. I didn't know it existed! I was looking into huggingface that seems to allow to build your own language model and train (still learning - but thats what I've learnt so far). I don't know how expensive these get (for example, if you have 100K lines?). Any thoughts on how this compares to HuggingFace and any anecdotes on time it would take to custom train ?
Haystack does not really compare to huggingface as they are directly using it as a library, it's more of a layer on top.
Serving a search system for 100K lines (short document?) is trivial on a single CPU machine, so the costs will be low.
For training if you stick to fine tuning and your dataset is relatively small (<10M) you could use collab with a free gpu. Training time really depends on your use case, the simplest classification may be handled in a couple of hours.
So how are you creating the embeddings for your search engine? GloVe? Sentence BERT? Are you training your own models? Are you employing any kind of normalization? There are so many variables to optimize on many levels. Which is, of course, what makes this whole area super exciting.
Really like the idea of teclis, i.e., a non-commercial search engine. Is it correct that teclis is HTTP-only (via port 1333) and TLS is not an option. (NB. I am not suggesting there is anything wrong with HTTP. I am simply curious if TLS is available.)
There is nothing wrong with using HTTP if the data transferred is not sensitive like in this case for demo purposes (and if anything, it is also faster for the user).
Running vector search (also sometimes referred to as semantic search, or a part of semantic search stack) is a trivial matter with open-source libraries like Faiss https://github.com/facebookresearch/faiss
It takes 5 minutes to set up. You can search billion vectors on common hardware. For low-latency (up to couple of hundred milliseconds) use cases, it is highly unlikely that any cloud solution like this would be a better choice than something deployed on premise because of the network overhead.
(worth noting is that there are about two dozen vector search libraries, all benchmarked at http://ann-benchmarks.com/ and most of them open-source)
A much more interesting (and harder) problem is creating good vectors to begin with. This refers to the process of converting a text or an image to a multidimensional vector, usually done by a machine learning model such as BERT (for text) or ImageNet (for images).
Try entering a query like 'gpt3' or '2019' into the news search demo linked in the Google's PR:
https://matchit.magellanic-clouds.com/
The results are nonsensical. Not because the vector search didn't do its job well, but because generated vectors were suboptimal to begin with. Having good vectors is 99% of the semantic search problem.
A nice demo of what semantic search can do is Google's Talk to Books https://books.google.com/talktobooks/
This area of research s fascinating. For those who want to play with this more, an interesting end-to-end (including both vector generation and search) open-source solution is Haystack https://github.com/deepset-ai/haystack