I built multiple systems using vector search, one of them demoed in a search engine for non-commercial content at http://teclis.com
Running vector search (also sometimes referred to as semantic search, or a part of semantic search stack) is a trivial matter with open-source libraries like Faiss https://github.com/facebookresearch/faiss
It takes 5 minutes to set up. You can search billion vectors on common hardware. For low-latency (up to couple of hundred milliseconds) use cases, it is highly unlikely that any cloud solution like this would be a better choice than something deployed on premise because of the network overhead.
(worth noting is that there are about two dozen vector search libraries, all benchmarked at http://ann-benchmarks.com/ and most of them open-source)
A much more interesting (and harder) problem is creating good vectors to begin with. This refers to the process of converting a text or an image to a multidimensional vector, usually done by a machine learning model such as BERT (for text) or ImageNet (for images).
Try entering a query like 'gpt3' or '2019' into the news search demo linked in the Google's PR:
The results are nonsensical. Not because the vector search didn't do its job well, but because generated vectors were suboptimal to begin with. Having good vectors is 99% of the semantic search problem.
This area of research s fascinating. For those who want to play with this more, an interesting end-to-end (including both vector generation and search) open-source solution is Haystack https://github.com/deepset-ai/haystack
There are two huge things your 5 minute setup is missing which are very hard techinically to tackle
1. Incrementally updating the search space. Not that easy to do, and becomes more important to not just do the dumb thing of retraining the entire index on every update for larger datasets.
2. Combining vector search and some database-like search in an efficient manner. I don't know if this Google post really solves that problem or if they just do the vector lookup followed by a parallelized linear scan, but this is still an open research/unsolved problem.
Spot on! Both of those were motivating factors when building Weaviate (Open Source Vector Search Engine). We really wanted it to feel like a full database or search engine. You should be able to do anything you would do with Elasticsearch, etc. There should be no waiting time between creating an object and searching. Incremental Updates and Deletes supported, etc.
Correct, that would take more than 5 minutes, although still possible to do with Faiss (and not that hard relatively speaking - in the Teclis demo, I indeed did your second point - combine results with a keyword search engine and there are many simple solutions you can use out there like Meilisearch, Sonic etc.e). If you were to try using an external API for vector search, you would still need to build keyword based search separately (and then combining/ranking logic) so then you may be better off just building the entire stack anyway.
Anyway, for me, the number one priority was latency and it is hard to beat on-premise search for that.
Even then, a vector search API is just one component you will need in your stack. You need to pick the right model, create vectors (GPU intensive), then possibly combine search results with keyword based search (say BM25) to improve accuracy etc. I am still waiting to see an end-to-end API doing all this.
>then possibly combine search results with keyword based search (say BM25) to >improve accuracy etc. I am still waiting to see an end-to-end API doing all this
>
> I am still waiting to see an end to end API doing all this
That’s kinda the idea of Weaviate. You might like the Wikipedia demo dataset that contains all this. You indeed need to run this demo on your own infra but the whole setup (from vector DB to ML models) is containerized https://github.com/semi-technologies/semantic-search-through...
Exactly right. Things like data freshness (live index updates), CRUD operations, metadata filtering, and horizontal scaling are all “extras” that don’t come with Faiss. Hence the need for solutions like Google Matching Engine and Pinecone.io.
And even if you do just want ANN and nothing else, some people just want to make API calls to a live service and not worry about anything else.
Can you expand more or provide a concrete example for the second point? What kind of database-like searches are you thinking about for spatial data? Things like range-queries can already be (approximately) done. Or are you thinking about relational style queries on data associated with each point?
Yes exactly, relational style queries with each data point. Maybe you have some metadata about your images and maybe you need to join against another table to properly query them. But at the same time you want to only grab the first k nearest neighbors according to vector similarity.
I agree having a good vector is important to start with. However this is not very hard to make it work, you only need to finetune some of the clip models[1] to run it well.
Disclose: I have built a vector search engine to proof this idea[2]
I just made a couple of searches with teclis. I have to say, it's not bad. It's clearly not complete and I get several empty searches. But the content of the results are of higher quality than what I get with Google or DDG. Nice work!
Thanks. The index is tiny and it is just a proof of concept of what a single person can do with technologies available nowadays. I felt it is better for it to return zero results than bad results.
As the site says this demo is by no means meant as a replacement for Google, but rather to complement it. I would say Teclis is good for content discovery and learning new things outside the typical search engine filter bubble. A few examples of good queries are listed on the site.
Not the author, but at work we've had in the hundreds of millions. Faiss can certainly scale.
If you do have a tiny index and want to try Google's version of vector search (as an alternative to Faiss), you can easily run ScaNN locally [1] (linked in the article, that's the underlying tech). On small scale I had better perf with ScaNN
This demo is only about million vectors. The largest I had in Faiss was embeddings of the entire Wikipedia (scale in the neighborhood of ~30 million vectors). I know people running few billion vectors in Faiss.
So one vector per article? Doesn’t this skew results? A short article with 0.9 relevance score would rank higher than a long article containing one paragraph with 1.0 relevance. Am I mistaken?
Also, BERT on cheap hardware? I thought that without a GPU, vectorizing millions of snippets or doing sub-second queries was basically out of the question.
CPU BERT inference is fast enough to embed 50 examples per second. Your large index is built offline, the query is embedded live but it's just one at a time. Approximate similarity search complexity is logarithmic so it's super fast on large collections.
It's about choosing the right Transformer model, there are several models which are smaller, with fewer parameters than bert-base which gives the exact same accuracy as bert-base, which you can use on a modern CPU single digit ms, even with a single intra-thread. See for example, https://github.com/vespa-engine/sample-apps/blob/master/msma...
I compared BERT[1], distilbert[2], mpnet[3] and minilm[4] in the past. But the results I got "out of the box" for semantic search were not better than using fastText, which is orders of magnitude faster. BERT and distilbert are 400x slower than fastText, minilm 300x, and mpnet 700x. At least if you are using a CPU-only machine. USE, xlmroberta and elmo were even worse (5,000 - 18,000x slower).
I also love how fast and easy it is to train your own fastText model.
Vector models are nothing but representation learning and applying the model out-of-domain usually gives worse results than plain old BM25. See https://arxiv.org/abs/2104.08663
A concrete example is DPR which is a state of the art dense retriever model for wikipedia for question answering, when applying that model on MS Marco passage ranking it performs worse than plain BM25.
>A much more interesting (and harder) problem is creating good vectors to begin >with.
Indeed, this is the hardest problem. Vector search shines when used in-domain using deep representation learning, for example bi-encoders on top of transformer models for text domain.
However, these models does not generalize well when used out of domain as the representations changes. Hence, in many cases, simple BM25 beats most of the dense vector models when used in a different domain. See https://arxiv.org/abs/2104.08663
Do you have a starting point for generating useful vectors for full text search? To simplify, I have a 350k document academic database, with docs of maybe 500-5000 words. I would like to be able to provide similarity results as a form of "concordance".
Do sentence, paragraph, or full doc vectors work the best? Do things work better creating vectors from sentence vectors for longer sections, or have you had better results making longer vectors directly? Do different vectorizing algorithms work better on different sized sections of text?
I have not had much luck finding discussion or write ups on this type of stragetizing on chunks and algorithms to apply, beyond just people saying that yes that's how it's done.
Depends on what you want your search results to be. Do you want a list of documents, sorted by relevancy? Or a list of sentences? Or paragraphs? Personally, I find paragraphs a good middle ground. However, if many of your documents contain very long paragraphs, it would still be tedious to go through your search results. So what might work better in this case would be to slide a window of, say, 1-3 sentences over each document. This way, you'll be able to find the most relevant 1-3 sentence snippet, both within one document, and within your entire corpus.
That is definitely part of the thought, and I am tending to thinking that there needs to be more than one index, to be able to compare both sentence level and larger sections on different tabs, since the comparisons are quite different in nature (avoiding hard lines, it is still the case that sentence comparison is closer to grammatical and linguistic dependency, whereas longer passages will tend to reflect deeper thematic relationships).
What I was attempting to ask was more on how to effectively compare longer sections of text. It sounds like you are saying it is fine to compare them directly. So, due to very, very irregular formatting between texts that will likely never change, we would probably be stuck using only sentences and 'documents' (an article, a chapter of a book, etc).
So my questions, from a technical perspective, are:
1) do you find you get better results treating these longer texts as a series or words or a series of sentences, and should we be performing a double vectorization, were we vectorize all sentences, and then vectorize the longer text as a series of sentence vectors instead of using a word vectorization technique directly?
2) Assuming we have decided we want to vectorize sentences and longer sections of texts regardless of whether we use sentence vectors for the latter, do we use the same or different vectorization techniques/algorithms for both tasks, and do you have any recommendations on specific algorithms for either case?
I'm quite new to the field myself. Many more experienced people, including the OP, I believe, would suggest that you use sentence transformers for this task. I personally don't understand how those are trained or fine-tuned, and I have never done it myself so far. What I do know is that sentence transformer results are horrible if you use them out of the box on anything other than the domain in which they were trained.
There's also the question of compute and RAM necessary to generate the embeddings. In one of my own projects, vectorizing 40M text snippets with sentence transformers would have taken 30 days on my desktop. So all I have worked with so far is fastText. It's several hundred times faster than any BERT model, and it can be trained quite easily, from scratch, on your own domain corpus. Again, this may be an outdated technique. But it's the one thing where I have some practical experience. And it does work quite well for creating a semantic search engine that is both fast and does not require a GPU.
The problem with fastText is that it only creates embeddings for words. You can use it to generate embeddings for a whole sentence or paragraph - but what it will do internally in this case is to just generate embeddings for all the words contained in the text, and then to average them. That doesn't give you really good representations of what the snippet is about, because common words like prepositions are given just as much weight as the actual keywords in the sentence.
Some smart people, however, figured out that you can pair BM25 with fastText: With BM25, you essentially create a word count dictionary on your corpus, that tells you which words are rare, and therefore especially meaningful, and which are commonplace. Then, you let fastText generate embeddings for all the words in your snippet. But, instead of averaging them with equal weights, you use the BM25 dictionary as your weights. Words that are rare and special thus get greater influence on the vector than commonplace words.
FastText understands no context, and it does not understand confirmation or negation. All it will do is find you the snippets that are using either the same words, or words that are often used in place of your query words within your corpus. But I find that is already a big improvement on mere keywords search, or even BM25, because it will find snippets that use different words, but talk about the same concept.
Since fastText is computationally "cheap", you can afford to split your documents into several overlapping sets of snippets: Whole paragraphs, 3 sentence, 2 sentence, 1 sentence, for instance. At query time, if two results overlap, you just display the one that ranks the highest.
Personally, I would imagine that doing document-level search wouldn't be very satisfying for the user. We're so used to Google finding us not only the page we're looking for, but also the position within the page where we can find the content that we're looking for. With a scientific article, it would be painful to having to scroll though the entire thing and skim it in order to know whether it actually is the answer to our query or not.
And with sentence-level, you'd be missing out on all context in which the sentence appears.
As the low hanging fruit, why not go with the units chosen by the authors of the articles, i.e. the paragraphs. Even if those vary wildly in length, it would be a starting point. If you then find that the results are too long, or too short, you can make adjustments. For snippets that are too short, you could "pull in" the next one. And for those that are too long, you could split them, perhaps even with overlap. I think for both the human end user and most NLP models, anything the length of a tweet, or around 200 characters, is about the sweet spot. If you can find a way to split your documents into such units, you'd probably do well, regardless of which technology you end up using.
You can also check out Weaviate. If you use Weaviate, you don't have to worry about creating embeddings at all. You just focus on how you split and structure your material. You could have, for instance, an index called "documents" and an index called "paragraphs". The former would contain things like publication date, name of the authors, etc.. And the latter would contain the text for each paragraph, along with the position in the document. Then, you can ask Weaviate to find you the paragraphs that are semantically closest to query XYZ. And to also tell you which articles they belong to.
You can also add negative "weights", i.e. search for paragraphs talking about "apple", but not in the context of "fruit" (i.e. if you're searching for Apple, the company).
These things work out of the box in Weaviate. Also, you can set up Weaviate using GloVe (basically the same as fastText), or you can set it up using Transformers. So you can try both approaches, without actually having to train or fine-tune the models yourself. With the transformers module, they also have a Q&A module, where it will actually search within your snippets and find you the substring which it thinks is the answer to your question.
I have successfully run Weaviate on a $50 DigitalOcean droplet for doing sub-second semantic queries on a corpus of 20+M text snippets. Only for getting the data into the system I had to use a more powerful server. But you can run the ingestion on a bigger server, and when it's done, you can just move things over to the smaller machine.
Thank you for the write up, since as far as my research has gone that is the best description of how to go about planning for vectorization I have seen. Undoubtedly experimentation on our corpus is required, but it helps to have an overview so we don't wildly run down the wrong paths early on.
Exactly. Training models, generating embeddings and building databases all can take days to run, and hundreds of Dollars in server costs. It’s painful to have done all that only to realize that one has gone down the wrong path. It pays to test your pipeline with a smaller batch first. Most likely you will discover some issues in your process. This iterative cycle is much faster if you work with a smaller test set first and do not switch over to your main corpus prematurely.
I recommend readers take parent post with a grain of salt.
(1) Google's offering returns with-in <5ms, in my experience.
(2) the demo is for paragraphs, not short text. You're putting mismatched data into the input, of course it's not going to work. Try a paragraph as suggested.
Hmm.. There is no web service that returns response in <5ms, unless you are sitting at the very terminal of the hardware producing the output.
The demo featured in this PR takes about 800-1000ms total to produce search results. How much of that is the actual API is not known. Typically, https request to an API in the cloud will cost you at least 50ms of network latency, more likely 100ms-200ms. If you are running vector search on premise you will obviously not have this overhead.
Text embeddings typically work for short text as well as paragraphs (paragraph embeddings are usually mean/max of word embeddings anyway) simply because most commercial use cases demand handling of short text input (because nobody is inputting a paragraph into a typical search box; what use is a news search if you can not type a single word like 'Biden' or 'gpt3' into it).
> Hmm.. There is no web service that returns response in <5ms, unless you are sitting at the very terminal of the hardware producing the output.
Typical Google Cloud Bigtable P(50) is <= 4ms to a GCE VM in the same region.
Source: I work on Google Cloud Bigtable.
EDIT: I should clarify that applies to point reads on an established connection using gRPC-over-HTTP/2 from e.g. Java/golang/C++ client, and doesn’t include cold connection set-up / uncached auth / TLS negotiation, but those are well into P(99.999) territory if you’re doing millions of QPS (which applies to most of our large customers).
Also pretty sure customers using hosted Redis / Memcached offerings like Google Cloud Memorystore or AWS Elasticache etc. get sub-1ms times for service-to-VM responses. Those aren’t over HTTP of course… so perhaps they don’t count as “web services”?
I think you are talking about latency over the equipment in the same datacenter/intranet here which, yes, I do not count as web services but are more akin to 'sitting at the terminal'. If I could access Google Cloud Bigtable over the web from my home Mac at <5ms, that would of course be the biggest thing since the invention of the web :)
It's a cloud offering, so the machines are located near each other. Using it with other cloud services is a fair comparison to running it in the same box on-prem.
The offering is similarity search, not a search engine. They offer image to image as another comparison point.
With the caveat of having to use GCP to host your server too, I can agree with you (although 5ms still sounds incredibly low, how many vectors was that?).
I was obviously talking about a general use case where a user considers using an API like this vs running Faiss, and their server can be anywhere (a use case that is more common to me personally).
We are only talking about backend performance so I don't really think it's fair to include network latency on this metric, it can also hugely varied based on your current network/location setup
> For low-latency (up to couple of hundred milliseconds) use cases, it is highly unlikely that any cloud solution like this would be a better choice than something deployed on premise because of the *network overhead*.
>Try entering a query like 'gpt3' or '2019' into the news search demo
The results are bad because this is not the kind of input the engine was trained to handle.
You should copy paste a part of an existing article. It will embed it into a multidimensional space and do a similarity search (the same way it was trained to (by converting a full article paragraph to a vector).
If you give it just a word it can't convert it to a meaningful representation because the network wasn't trained to do this.
But you can train it differently and have it able to handle a few words. For example you can summarize every article to a few sentences and keywords and use a traditional keyword search.
One usual way to create good vector representation is to use encode simultaneously two different space to the same vector space. You encode 'queries' and 'answers' such that they are close for the known (query-answer) pairs. This is what CLIP did, encoding both images and their corresponding description to a same vector space.
You can download the precomputed clip embeddings LAION-400-MILLION OPEN DATASET on academictorrents.com
CLIP can do such thing for the problem of semantic image search because the problem of matching an image to its description is quite well defined. But quite often there is no unique apriori meaningful way of matching a query to an answer, specially as the index get big.
In the case of a basic query 'gpt-3', the query is quite vague and its not obvious with respect to which direction you should do the ranking (Do you meaning you want articles generated by gpt-3 ? Articles containing gpt-3, a basic definition ?). There is no a priori good answer, and that's where you can use your additional context to refine the query. For example Siri or a NLP bot, could ask you to be more explicit in what you mean.
Or it can have multiple representation of your vector space and return the top-1 for each of those representation, and hope that you give it feedback by clicking the one that was more meaningful to you as requester.
Thanks for the reference to haystack. I didn't know it existed! I was looking into huggingface that seems to allow to build your own language model and train (still learning - but thats what I've learnt so far). I don't know how expensive these get (for example, if you have 100K lines?). Any thoughts on how this compares to HuggingFace and any anecdotes on time it would take to custom train ?
Haystack does not really compare to huggingface as they are directly using it as a library, it's more of a layer on top.
Serving a search system for 100K lines (short document?) is trivial on a single CPU machine, so the costs will be low.
For training if you stick to fine tuning and your dataset is relatively small (<10M) you could use collab with a free gpu. Training time really depends on your use case, the simplest classification may be handled in a couple of hours.
So how are you creating the embeddings for your search engine? GloVe? Sentence BERT? Are you training your own models? Are you employing any kind of normalization? There are so many variables to optimize on many levels. Which is, of course, what makes this whole area super exciting.
Really like the idea of teclis, i.e., a non-commercial search engine. Is it correct that teclis is HTTP-only (via port 1333) and TLS is not an option. (NB. I am not suggesting there is anything wrong with HTTP. I am simply curious if TLS is available.)
There is nothing wrong with using HTTP if the data transferred is not sensitive like in this case for demo purposes (and if anything, it is also faster for the user).
People say google search is terrible these days, but I find the opposite.
I can vaguely describe in a sentence the gist of an article I've read, or an image, and the proper result will usually be in the first page.
Of course, it doesn't always work, sometimes there are "hash collisions" so to speak, but I don't think the old algorithm would have been more successfully either, since if I knew the exact keywords to use, I wouldn't need to start with a vague description in the first place.
>I can vaguely describe in a sentence the gist of an article I've read, or an image, and the proper result will usually be in the first page.
For the specific context of "I can find something I've already found", yes, it's useful. I just wish there was a way to change that context to "discovery mode" where it uses a different algorithm that is oriented toward finding new information. I want to find sites in the spirit of those old-fashioned sites that are minimally styled with dense information. And not just Wikipedia or a few "trusted" sources like it used to be in earlier times, but a more well-rounded result set.
> sites in the spirit of those old-fashioned sites
I think the problem is that such sites are very difficult to find algorithmically, especially when it comes to their poor SEO. The reason they used to be so prevalent in the early 2000s search results is because that's mostly what the web was back then; a bunch of personal websites, blogs, etc.
To do that nowadays would require heavy (manual) curation, which obviously Google isn't interested in.
> I think the problem is that such sites are very difficult to find algorithmically
I don't even think it's that. The issue with older content isn't the search but the sorting. Google and to a lesser extent Bing (and thus DDG) have prioritized content claiming to be more recent relative to the time they were indexed in their display of results.
Showing more recent content is likely generally a safer bet for a search engine. More recent content is less likely to have suffered link rot and covers more recent developments in a subject.
Unfortunately Google et al don't really respect user preferences with respect to sorting. They have a financial incentive to show results they can turn into dollars (or cents).
I don't think it's entirely true that the current Google results are merely a matter of decent sites being hard to find. I'm pretty sure that two years ago, Google found a great portion of "real content" pages than it finds today and SEO was already huge then. And Google does a significant amount of human testing right now.
It especially noticeable to me that just in the last few months, Google has changed their algorithm so some product will be the first item on even the most generic search.
> For the specific context of "I can find something I've already found",
I find it to be utterly terrible for that too.. even when I have verbatim strings from the thing I'm looking for it often simply doesn't show up. ... often because it rewrites the query into something about Kim Kardashian's butt and no amount of quotes or pluses will make it stop.
I'd like to join the chorus disappointed in Google search results.
It seems to do more correction for you, which is great if you're searching for common popular things. But any uncommon or precise query will often be misunderstood as something else.
Plenty of times, no matter how I reword my sentence or what sort of analogies I try to give it, I've had Google fail to give me something that I know exists and that I have to find some other way.
Note that this article is careful to never say that this "vector search" technology powers the classic Google Search. This sort of automated classification space is probably part of Google's general search algorithm, but it's probably a very small part. Youtube recommendations (based on description + thumbnail + potentially video content?) and Google Image Search are the two in-practice examples that it focuses on.
I've literally gone to Google and typed something very similar to "That guy in that thing with the dog" and the correct answer shows up as the first result. It's quite brilliant and magical how they do that.
But sometimes it's a total miss when I want something very specific, and it just shows me other things I didn't ask for.
I have more trouble finding exactly what I'm looking for using Google, for me it started going down hill when they removed the Plus operator (and no quotes don't work the same).
Also, Yandex is much better for reverse image search (similar images).
Yeah I remember that in the past, maybe so 5+ years ago, I used to phrase things in a certain way to please their algorithm. This is not needed any more. Sometimes it doesn't grasp concepts and returns false results. But this has become rarer as well.
I defended Google search without realizing my country was among the last countries getting all the stupid updates. I do not use Google search anymore. Why would I? I get top results that are clearly manipulated. Sites filled with ngrams and so on.
I think Google used to be better by weighting incoming links higher than outcoming, e.g. these days searching for a programming problem you get irrelevant "StackOverflow/Github clones". Cosine similarity only considers, obviously, similarities, which can be helpful in contextless scenarios, but the amount of undetected duplicates in a specific context is stupid.
If you don't know what you're looking for, it works well. If you know exactly what you want, specific keywords etc., good luck because it's gone from the mildly condescending "Did you mean..." corrections to "Yeah I am going to ignore all that, I bet what you really want to find is some ads."
> I can vaguely describe in a sentence the gist of an article I've read, or an image, and the proper result will usually be in the first page.
I can never do that. What I remember is so far from the wording used or how Google identifies the image that it never comes up. I end up scrolling back through my history trying to do it from the page title or domain
I disagree about the quality of Google search but I should note this has nothing to do with the utility of Google's vector search library, which is just one low level part of the process of creating a final Google and I'd expect the technical quality here to be excellent by default.
Whether one likes or hates current Google search results, their qualities and the changes from early search processes are clearly intentional and don't relate to how well Google does raw indexing.
Dumb question: How does Weaviate know that "Scandinavian" is close to "Finnish" ? The source not having "Scandinavian" at all.
If their vectors are close, then the "vectorization" is quite standard for any text, and also per language?
The models that create the vector embeddings are trained on either general or domain specific knowledge. So, to oversimplify it a bit: The model has learned - based on the training data it was presented with - that "Scandinavian" has a relationship to "Finnish". Since the vector space is high-dimensional you can think of each language concept having a distinct place in that space. In this case the concept for "Scandinavian" and "Finnish" were close enough that you got a matching result. To simplifiy it even more: The vectors do not represent the words but the meaning behind them. So, the two sentences "I like wine" and "The fermented juice of grapes is my favorite beverage" have zero keywords overlapping, but are semantically identical. So a good model would give them very similar vectors even though a traditional search engine would find zero resemblence between them.
EDIT: Just realized I didn't answer the second part of your question. Yes, the models are language-specific, but there are also multilingual models that work across a large no. of different languages.
I agree, but at the same time now is the easiest it's ever been to create great vectors. Sentence-Bert [1] by Nils Reimers is a collection of pre-trained models specficially trained to create good vectors. You can use them out of the box with Weaviate. All you have to do is select your desired model [2] and your text (or images, etc.) will be translated into vectors at import time. As I mentioned in another comment, with Weaviate the goal is to make it as easy to use as any existing search engine or database while still providing you the benefits of Deep Learning & Vector Search.
It's great to see more and more talk of vector search and vector databases. We've been promoting this technology for over a year now and have several intro articles for anyone looking to learn more[1], and a generous free tier on our vector search service[2] for anyone looking to give vector search a shot.
As for Pinecone itself, what are the main selling points as you see them for a simple application (e.g. comparing trigram-vectorized sets of strings) when compared to a home-rolled solution using postgres with array types? Better performance, ease of indexing, etc.?
(1) Pinecone uses dense vectors which can encode much more meaningful info, eg the actual 'semantic meaning' behind a sentence as we (people) would understand it, or the context in an image. Because of this, we can enable much richer, human-like interaction/search in your applications
(2) Performance wise, before joining Pinecone I was spending a lot of time with other dense vectors search tools like Faiss, and it isn't easy to get good or even reasonable accuracy and latency, particularly for large datasets. When I first used Pinecone, it took me maybe 10 minutes to figure everything out and start querying a reasonable dataset, search times were very fast and the accuracy incredible. Pinecone's tech is built by people that live and breath vector search, and what they've built outperforms anything I can build, even if I spend months trying to build it. I got better performance with Pinecone in 10 mins.
(3) Everything is production ready, no need to worry about deployment, security, maintenance etc, Pinecone deal with it and you can even use the service for free up to 1M vectors.
I pinged someone more technical from our team to chime in.
In the meantime I can say moving to the dense vector + ANN search combo turns regular searches into semantic searches, which means more relevant results.
If that's the case for you, then you can use Pinecone to go further and make those results fast (<100ms), fresh (CRUD + live index updates), and filtered (apply single-stage metadata- filtering). All on a fully managed system that you can scale up/down with one API call.
I've been toying with making a deckbuilder for Magic: The Gathering and could see this being potentially useful for finding fun card combinations. Thanks!
That would be a fun use case for us to promote. Let me know when it's ready! The free plan supports as many as 1 million items, more than enough for the all MTG cards in existence. Plus you can add and filter by metadata, like card type and properties.
> Plus you can add and filter by metadata, like card type and properties.
I read through your docs and figure that will be part of the approach.
An idea I had was to find similar, or "next best", cards for replacement in popular decks or to achieve similar effects in order to bring down the cost of EDH, Modern, etc. formats. I'm just getting back into the hobby again, so having a tool like this would make my wife and wallet happy :)
I just want to chime in and say that the resources on your website look amazing. I spent 5 minutes poking around and it looks really high quality.
I'm dabbling in Postgres's full text search (ts_vector) for a small website, I know that is extremely simple compared to the offerings you provide, but your site has me quite interested in this space now.
Does Pinecone have any position on the status of document embeddings and whether they would be considered PII? One of the challenges of using a fully managed service is the headache of adding yet another data subprocessor and all of the legal and compliance questions that raises.
That depends on the document. We do not see the original document, only the embedding. You can argue that is sufficiently obfuscated to not count as PII. The good news is we are SOC2 compliant and GDPR-friendly and do a bunch of other stuff to help you meet security compliance requirements: https://www.pinecone.io/security/
No, I understand that. I guess my question is actually around your experience with "You can argue that is sufficiently obfuscated to not count as PII" and whether your customers are actually successful with this argument.
Those who need more assurance just look at our SOC2 compliance, or have us go through a security review, or opt for the dedicated-environment deployment option.
Let's say I have a content website with about 20k content pages. I want to automatically cluster the pages so that the each page has the related content linked. Right now I'm using a hacked together tf–idf using sklearn and Python2, and it just works. The downsides are that I have to compute everything offline whenever I add new content, and that it's one more thing to maintain/upgrade.
I'm wondering if anyone has a suggestion of a SaaS or another alternative for my use case? Thanks!
What if we had local vector search on our web browser history (the content as it was displayed)? That would be radical. I'm wondering why browser vendors don't scramble to create the personal vector database. It could be integrated through a browser extension to insert local results when doing regular web searches, or provide context for a speech based personal assistant. Having a neural net at hand could also prove useful in semantic filtering of webpages (hide or highlight content) and curating your news feeds.
This gh repo makes it pretty easy to create similar tech by first embedding any images you have using the released "CLIP" model from Open AI and then creating a Faiss index over these embeds for quick retrieval/decode. You can then do text->image, and image->image semantic search.
More "Find _something_ fast with vector search". I was not successful in finding anything relevant. PageRank works because it _ranks_ pages by, among other features, number and quality of visitors.
E.g. searching for Huxley quote gives me silly blog posts about saving money.
Query: "The function of the brain and nervous system is to protect us from being overwhelmed and confused by this mass of largely useless and irrelevant knowledge, by shutting out most of what we should otherwise perceive or remember at any moment, and leaving only that very small and special selection which is likely to be practically useful."
Answer: "How to trick your brain into saving money"
We are developing open-source vector search technology. https://github.com/qdrant/qdrant It is a neural search engine with extended filtering support that implements a custom modification of the HNSW algorithm for Approximate Nearest Neighbour search.
It allows applying search filters, including geolocation, without compromising on results. Developed entirely in Rust language. You can find some demos and documentation here https://qdrant.tech
We have built Milvus vector database upon ANN libraries like faiss, annoy, nsmlib, etc.
We are aiming to create a cloud-scalable vector database. So Milvus comes to the crossroad of vector search and cloud database. There are many interesting system design topics in the development of Milvus 2.0. We will continue to share our experiences and thoughts on this topic.
It's not very good. I tried different pictures and the results are almost random.
A picture from a cartoon returns from logos to any type of drawing.
A picture of a battery returns cars and shops.
A picture of food worked as expected and I got more food pictures.
For a similar ANN/vector search capabilities, https://vespa.ai/ is a great open-source solution. Elasticsearch may offer some form of ANN too but need to double check
Vespa also allows expressing hybrid sparse and dense retrieval (WAND for sparse, ANN via HNSW for dense). It's also easy to express multi stage retrieval ranking phases, as vector search alone is not achieving state-of-the-art ranking results, see https://blog.vespa.ai/pretrained-transformer-language-models...
There is a lot done vector search technology right now.
I was less fortunate when looking at vector storage.
I already looked at Pinecone or Weaviate but they are all paid products.
ElasticSearch supports vectors (dense ones, they supported sparse ones at some point but they removed support I think), and has things like cosine similarity functions built in.
I'm curious about Gensim Doc2Vec Model. I used it 3 years ago and got decent results in vectorizing articles and then finding articles that were similar based on input text (half-written article for example).
More or less but as always the devil is in the detail. Here is a paper[1] that summarizes issues with naive approaches. Incidentally. the proposed solution (Hierarchical NSW) in this paper performed fairly well in the industry benchmarks
A k-d tree is a data structure. Whether you use that for exact nearest neighbor query or approximate is up to the algorithm used. K-d trees work well for a handful of dimension beyond that it becomes quite expensive.
So how do you game this? "Googlebomb" this? I assume it's harder than keyword-based search? As a search engine, what efforts do I take to stop someone from gaming vector-based search engines?
Considering the number of possible keywords and comparing this with what are feasible vector lengths I wonder whether vector search is not weaker when looking at the long tail.
Running vector search (also sometimes referred to as semantic search, or a part of semantic search stack) is a trivial matter with open-source libraries like Faiss https://github.com/facebookresearch/faiss
It takes 5 minutes to set up. You can search billion vectors on common hardware. For low-latency (up to couple of hundred milliseconds) use cases, it is highly unlikely that any cloud solution like this would be a better choice than something deployed on premise because of the network overhead.
(worth noting is that there are about two dozen vector search libraries, all benchmarked at http://ann-benchmarks.com/ and most of them open-source)
A much more interesting (and harder) problem is creating good vectors to begin with. This refers to the process of converting a text or an image to a multidimensional vector, usually done by a machine learning model such as BERT (for text) or ImageNet (for images).
Try entering a query like 'gpt3' or '2019' into the news search demo linked in the Google's PR:
https://matchit.magellanic-clouds.com/
The results are nonsensical. Not because the vector search didn't do its job well, but because generated vectors were suboptimal to begin with. Having good vectors is 99% of the semantic search problem.
A nice demo of what semantic search can do is Google's Talk to Books https://books.google.com/talktobooks/
This area of research s fascinating. For those who want to play with this more, an interesting end-to-end (including both vector generation and search) open-source solution is Haystack https://github.com/deepset-ai/haystack