I don't understand the emphasis here on vertical scaling. Move a database to a bigger machine = more storage and faster querying. Not exactly rocket science. Horizontal scaling is the real challenge here, and the complexity of vector indexes makes it especially challenging. Milvus and Vertex AI both have horizontal scaling ANN search and the ability to do parallel indexing as well. I appreciate the post but this doesn't seem worthy of an announcement.
Completely true. You have to understand the economics behind this to see why their claim is hyperbole at best and flat out misleading at worst. The fundamentals of scalable vector search is that you are dealing with potentially huge dimensionality and huge datasets, that means that memory consumption will be huge even for a modest (by today's standards) datasets. This problem has garnered lots of research attention, so making such a bold claim makes you think what Pinecone has under the hood that others don't.
Pinecone is VC backed and they have taken in to the tune of $50M in funding. They have to claim the "first" in solving these challenging technical problem, otherwise they'd have to really explain that their "secret source" is not really ground-breaking but relying on a series of open-source components under the hood. VCs wouldn't want to be backing yet another donkey in the derby. The truth is that solutions like FAISS, ScaNN, Weaviate, Quadrant, ANNOY and co. are working on this problem on a much more fundamental level. Pinecone and Google vertex matching AI are working on it on a application level. If Pinecones's solution is truly groundbreaking, they'd publish it in a more scientifically rigorous way. So these claims are to be taken with a grain of salt for what they are: developer evangelism/marketing speak.
I led a prod project that uses FAISS at a bank. A huge amount of the work was about making the index practical in a real IT environment. For example we had to build a sharding system to allow it to scale, but also to allow it to be rebuilt with 0 down time. There were many other significant engineering steps required to get it depolyed.
So, I would say that if Pinecone could solve these problems they don't need to have fundamental breakthrough performance vs the open source systems. On top of that, as every dev knows, there are a bunch of hygiene components and features that prod software wants - connectors, admin interfaces, utilities. $50m is probably a bit low to cover all of these and the marketing to be honest - but it will go a long long way and I guess that there's series B funding to get over the line if they don't sell out.
On the otherhand they must avoid over-committing to the indexing approaches of today because if someone does make an algorithmic step forward and Pinecone don't / can't take advantage then the features that they provide that enable deployment are a matter of engineering. Also, at the end of the day I think vectorDB's are going to be an important niche in the enterprise and not at the scale of data warehouses, lakes, or application DB's. I think that fitting them into the enterprise IT puzzle scape is going to be very important in making them commercially successful and good VC investments.
This sounds like a perfect job for jina.ai -> sharding, redudancy, automatic up- and down scaling, security features and most importantly: flexibility to use and switch whatever (vector) database
https://jina.ai/
Bigger machine doesn't automatically mean higher performance. The code needs to scale with the increased number of cores, has share-nothing or share-very-little approach to avoid contention, and uses efficient data structure to utilize the increased memory.
Larger disk space does help effortlessly scale storage in single-node systems, but I agree with you that shared nothing (and/or shared something) is a necessary step for extracting maximum performance on a larger machine. When it comes to distributed architectures, shared nothingness is important as well. The decoupling of storage from compute and stateless from stateful helps minimize resource allocation when it comes to billion-scale vector storage, indexing, and search. Milvus 2.0 implements this type of architecture - here's a link to our VLDB 2022 paper, if you're interested: https://arxiv.org/abs/2206.13843
Just having "moar disk!" ≠ "scalability." Because unlike running single-thread on a shard-per-core [or hyperthread] basis, aligning NUMA memory to those cores, etc., there's no way to make your storage "shared-nothing."
At ScyllaDB we've put years of non-trivial effort into IO scheduling to optimize it for large amounts of storage. You also need to consider the type of workload. Because optimizing for reads, writes, or mixed workloads are all different beasties.
I agree in the typical case, but they support concurrent add/delete one of their index options. Handling consistency/contention for modifying whatever graph/tree/etc structure they are using is probably nontrivial, and the resulting cache invalidations would also likely affect the QPS.
P.S. Great work on your site, by the way - it's a really inspiring project!
Seems to me you can do that in a way that ensures low contention between consumers by using a read-biased MRSW lock. It's not free such a construction, but it really shouldn't eat into your read performance all that much. You're adding hundreds of nanoseconds to your query time by acquiring and releasing a lock. Unless you're already serving millions of queries per second per thread, this is piss in the ocean.
> With vertical scaling, pod capacities can be doubled for a live index with zero downtime. Pods are now available in different sizes — x1, x2, x4, and x8 — so you can start with the exact capacity you need and easily scale your index.
This is developer evangelism at its best at the behest of VCs to scale and "productionalize". There are reasons why this problem is a fundamentally difficult and coming out of the blue claiming to have found a 10x solution in a field that has attracted lots of research interest is highly sus. I would love to see a study that actually exposed their methodology, replication from independent parties and more importantly CODE.
Confused by their claim to be the 'first' vector database. These things have been around forever? For example FLANN (not a DB server, but example lib) is from 2009
I think the difference is in the layer of abstraction i.e. FLANN is just the underlying search functionality whereas vector databases are fully managed solutions. Even so, Weaviate came out in 2018, so saying that they are the "first" vector database is just flat out wrong since Pinecone was founded in 2019.
The fact that Weaviate only recently started calling themselves a vector database is completely irrelevant here. They had this type of vector data infrastructure before Pinecone did, and that's all that matters.
Example: I'm going to start a new company called Conifercone and do pretty much exactly what you do, but call it a "vector datastore" instead. Apparently I've now created the first ever vector datastore even though functionally I have done nothing novel.
Just to clarify, Milvus is much more than a wrapper around FAISS. Our vector search component called Knowhere (https://github.com/milvus-io/knowhere) utilizes FAISS and Annoy and will soon include ScaNN, DiskANN, and in-house vector indexes as well. Milvus uses Knowhere as the compute engine, and implements a variety of database functions such as horizontal scaling, caching, replication, failover, and object storage on top of Knowhere. If you're interested, I recommend checking out our architecture page (https://milvus.io/docs/architecture_overview.md).
[EDIT]: Forgot to mention - Milvus development began in 2018 was open sourced in 2019.
I’m not sure where the other commenter gets their confidence, but Pinecone is not wrapping any open source vector-search library. We offer three index types (in-memory, in-memory graph-based, hybrid memory + disk), and all are proprietary.
We do have articles about Faiss and HNSW and all sorts of other vector-search and NLP topics, so it’s possible that’s where the confusion comes from.
They used to publish some benchmarks on their site, but seem to have removed them. You can find them on archive.org[1]. I guess it is understandable, since vector search performance is pretty unpredictable, and depends on a lot of factors. If their target market is people who want vector search without needing to read a bunch of papers first, benchmarks might be more confusing than they are helpful.
edit: While I do think it's understandable, it's not great for transparency. Even if they don't want to open-source their index, I would admire it if they were willing to give ann-benchmarks[2] an API key to publish some independent results.
Disclaimer: I work on vector search at a different company
If you really think that's enough to build a real product, go for it.
Even open-source companies (Elastic, Mongo, Scylla) have to build tons of infra around their core codebase in order to make it an actual cloud product.
Pinecone doesn’t use Faiss, nor ScaNN. We love Faiss and even teach people to use it[1]. There happens to be a sizable population of engineers who need more than what Faiss provides (like live index updates and metadata filtering, for example), and can’t be bothered or aren’t being paid to customize and manage open-source libraries all day.
So you guys developed and implemented state of the art neural network vector search from scratch? in a year? and something better than libraries with tens of contributors over years of research?
Most vector search research teams are a lot smaller than you suggest, and haven't been around that long (e.g. the FAISS paper was published in 2017).
From public info, you can see they have at least one researcher working there. It's believable to me that they could have some new innovations, especially since the product space they're focusing on is different from other teams working on vector search. State-of-the-art for a specific set of constraints is still state-of-the-art.
However, considering how much of their edu-marketing content is posted to HN, it would be great if they could share more details about the internals of their index with the community. One of the great things about vector search is how many techniques are open sourced or documented in papers :).
Disclaimer: I work on vector search at a different company
I actually built a similar solution supporting similar operations (including filtering by meta-data) using open-source libraries. Took me about 2 weeks net.
I can see a clientele for such database (people who want a turnkey solution), but honestly it looks like an attempt to use a dev-ops solution to address deeper issues with problem formulation: e.g.
1. Is there really a need to search all items in the database? can subsampling make simple similarity comparison feasible?
2. Do the embeddings really need to have that many dimensions? Can we reduce their dimensionality and fit them in RAM?
3. Is embedding accurate enough compared to pairwise comparison? Can we formulate the problem to make the latter feasible?
I also could not find any explanation of the underlying algorithms, especially around meta-data filtering, which is not solved by FAISS as well as their accuracy. (happy to hear otherwise)