Hacker News new | past | comments | ask | show | jobs | submit login
Vector search just got up to 10x faster and vertically scalable (pinecone.io)
92 points by gk1 on Aug 16, 2022 | hide | past | favorite | 49 comments



I don't understand the emphasis here on vertical scaling. Move a database to a bigger machine = more storage and faster querying. Not exactly rocket science. Horizontal scaling is the real challenge here, and the complexity of vector indexes makes it especially challenging. Milvus and Vertex AI both have horizontal scaling ANN search and the ability to do parallel indexing as well. I appreciate the post but this doesn't seem worthy of an announcement.


Completely true. You have to understand the economics behind this to see why their claim is hyperbole at best and flat out misleading at worst. The fundamentals of scalable vector search is that you are dealing with potentially huge dimensionality and huge datasets, that means that memory consumption will be huge even for a modest (by today's standards) datasets. This problem has garnered lots of research attention, so making such a bold claim makes you think what Pinecone has under the hood that others don't.

Pinecone is VC backed and they have taken in to the tune of $50M in funding. They have to claim the "first" in solving these challenging technical problem, otherwise they'd have to really explain that their "secret source" is not really ground-breaking but relying on a series of open-source components under the hood. VCs wouldn't want to be backing yet another donkey in the derby. The truth is that solutions like FAISS, ScaNN, Weaviate, Quadrant, ANNOY and co. are working on this problem on a much more fundamental level. Pinecone and Google vertex matching AI are working on it on a application level. If Pinecones's solution is truly groundbreaking, they'd publish it in a more scientifically rigorous way. So these claims are to be taken with a grain of salt for what they are: developer evangelism/marketing speak.


There are 2 dimensions here :

* fundamental performance

* practical utilization

I led a prod project that uses FAISS at a bank. A huge amount of the work was about making the index practical in a real IT environment. For example we had to build a sharding system to allow it to scale, but also to allow it to be rebuilt with 0 down time. There were many other significant engineering steps required to get it depolyed.

So, I would say that if Pinecone could solve these problems they don't need to have fundamental breakthrough performance vs the open source systems. On top of that, as every dev knows, there are a bunch of hygiene components and features that prod software wants - connectors, admin interfaces, utilities. $50m is probably a bit low to cover all of these and the marketing to be honest - but it will go a long long way and I guess that there's series B funding to get over the line if they don't sell out.

On the otherhand they must avoid over-committing to the indexing approaches of today because if someone does make an algorithmic step forward and Pinecone don't / can't take advantage then the features that they provide that enable deployment are a matter of engineering. Also, at the end of the day I think vectorDB's are going to be an important niche in the enterprise and not at the scale of data warehouses, lakes, or application DB's. I think that fitting them into the enterprise IT puzzle scape is going to be very important in making them commercially successful and good VC investments.


This sounds like a perfect job for jina.ai -> sharding, redudancy, automatic up- and down scaling, security features and most importantly: flexibility to use and switch whatever (vector) database https://jina.ai/


On the topic of publishing, these two papers from the Milvus community may be of interest to some folks.

SIGMOD'21 - https://www.cs.purdue.edu/homes/csjgwang/pubs/SIGMOD21_Milvu... This paper talks about the vector database vertical (compute core and user-facing API)

VLDB'22 - https://arxiv.org/pdf/2206.13843.pdf This paper discusses the development of a cloud-native vector database.

Reading through them should help folks understand where the novelty and difficulty in developing an full-fledged vector database comes from.

Disclaimer: I'm a member of the Milvus community.


Bigger machine doesn't automatically mean higher performance. The code needs to scale with the increased number of cores, has share-nothing or share-very-little approach to avoid contention, and uses efficient data structure to utilize the increased memory.


Larger disk space does help effortlessly scale storage in single-node systems, but I agree with you that shared nothing (and/or shared something) is a necessary step for extracting maximum performance on a larger machine. When it comes to distributed architectures, shared nothingness is important as well. The decoupling of storage from compute and stateless from stateful helps minimize resource allocation when it comes to billion-scale vector storage, indexing, and search. Milvus 2.0 implements this type of architecture - here's a link to our VLDB 2022 paper, if you're interested: https://arxiv.org/abs/2206.13843


Just having "moar disk!" ≠ "scalability." Because unlike running single-thread on a shard-per-core [or hyperthread] basis, aligning NUMA memory to those cores, etc., there's no way to make your storage "shared-nothing."

At ScyllaDB we've put years of non-trivial effort into IO scheduling to optimize it for large amounts of storage. You also need to consider the type of workload. Because optimizing for reads, writes, or mixed workloads are all different beasties.

More here:

https://www.scylladb.com/2022/08/03/implementing-a-new-io-sc...


Why would search queries have contention with each other? Surely it's in the domain of embarrassingly parallel.


I agree in the typical case, but they support concurrent add/delete one of their index options. Handling consistency/contention for modifying whatever graph/tree/etc structure they are using is probably nontrivial, and the resulting cache invalidations would also likely affect the QPS.

P.S. Great work on your site, by the way - it's a really inspiring project!


Seems to me you can do that in a way that ensures low contention between consumers by using a read-biased MRSW lock. It's not free such a construction, but it really shouldn't eat into your read performance all that much. You're adding hundreds of nanoseconds to your query time by acquiring and releasing a lock. Unless you're already serving millions of queries per second per thread, this is piss in the ocean.


Disk IO needs concurrency. And you're back to Little's Law.


I guess it's about downtime:

> With vertical scaling, pod capacities can be doubled for a live index with zero downtime. Pods are now available in different sizes — x1, x2, x4, and x8 — so you can start with the exact capacity you need and easily scale your index.


Horizontal scaling has been a feature of Pinecone for a while now.


This is developer evangelism at its best at the behest of VCs to scale and "productionalize". There are reasons why this problem is a fundamentally difficult and coming out of the blue claiming to have found a 10x solution in a field that has attracted lots of research interest is highly sus. I would love to see a study that actually exposed their methodology, replication from independent parties and more importantly CODE.


If code and methodology is what you're looking for, there's some great open-source vector databases out there.

Milvus: https://github.com/milvus-io/milvus

Qdrant: https://github.com/qdrant/qdrant

Weaviate: https://github.com/semi-technologies/weaviate

Milvus seems to be the most advanced and best performing vector DB (https://www.farfetchtechblog.com/en/blog/post/powering-ai-wi...). Haven't seen Qdrant benchmarks yet but cool project nonetheless.

FWIW, these open source projects are how I got into the area of vector similarity search to begin with.


Confused by their claim to be the 'first' vector database. These things have been around forever? For example FLANN (not a DB server, but example lib) is from 2009


I think the difference is in the layer of abstraction i.e. FLANN is just the underlying search functionality whereas vector databases are fully managed solutions. Even so, Weaviate came out in 2018, so saying that they are the "first" vector database is just flat out wrong since Pinecone was founded in 2019.


Same difference as ElasticSearch and Lucene.

re: difference in layer of abstraction.


Weaviate calling themselves a vector database is a fairly new thing.


The fact that Weaviate only recently started calling themselves a vector database is completely irrelevant here. They had this type of vector data infrastructure before Pinecone did, and that's all that matters.

Example: I'm going to start a new company called Conifercone and do pretty much exactly what you do, but call it a "vector datastore" instead. Apparently I've now created the first ever vector datastore even though functionally I have done nothing novel.


I'm affiliated with Weaviate, so maybe nice to get this out here for the record :)

We call Weaviate a "vector search engine" (i.e., we prefer "vector search engine" because it describes the type of database) since around Aug, 2020

Github: https://github.com/semi-technologies/weaviate/tree/a3967aff5...

The reason was simple; our community started to say that the mixed vector and scalar filter search capabilities were what they liked most.

Also, our benchmarks are available for quite some time here: https://weaviate.io/developers/weaviate/current/benchmarks/a...

They are based on ann-benchmarks.com but adjusted for full databases.


There’s also Vespa from yahoo that has been used at scale for years (decades?): https://docs.vespa.ai/search.html?q=Vector


Interesting connection considering that some of the Pinecone founders are ex-Yahoos.


Side tangent: Pinecone pods seem to cost 15.625% more per hour on AWS compared to GCP.

All the instance types hide away the price differences usually, so this is interesting to see.

Edit: also there is no free tier for Pinecone on AWS :(


10x faster with respect to what?


It was more than 10x slower than Vespa, Weaviate, or Qdrant before. Maybe they are finally getting close.


For anyone interested in the code walkthrough: https://www.pinecone.io/learn/testing-p2-collections-scaling...


Didn't Milvus (vector db, wrapper around FAISS) come before Pinecone?


Just to clarify, Milvus is much more than a wrapper around FAISS. Our vector search component called Knowhere (https://github.com/milvus-io/knowhere) utilizes FAISS and Annoy and will soon include ScaNN, DiskANN, and in-house vector indexes as well. Milvus uses Knowhere as the compute engine, and implements a variety of database functions such as horizontal scaling, caching, replication, failover, and object storage on top of Knowhere. If you're interested, I recommend checking out our architecture page (https://milvus.io/docs/architecture_overview.md).

[EDIT]: Forgot to mention - Milvus development began in 2018 was open sourced in 2019.


Its weird to see someone flex vertical scaling...


a custom datatype or fdw in postgresql seems interesting to me.


There's one, but with some limitations (For example - only vectors of up to 1024 dimensions)

https://github.com/pgvector/pgvector


[flagged]


Which open source libraries is pinecone wrapping?


I’m not sure where the other commenter gets their confidence, but Pinecone is not wrapping any open source vector-search library. We offer three index types (in-memory, in-memory graph-based, hybrid memory + disk), and all are proprietary.

We do have articles about Faiss and HNSW and all sorts of other vector-search and NLP topics, so it’s possible that’s where the confusion comes from.


so, how does your proprietary solution compare against FAISS, eg with 10M dense vectors of 1024 dimensions?


They used to publish some benchmarks on their site, but seem to have removed them. You can find them on archive.org[1]. I guess it is understandable, since vector search performance is pretty unpredictable, and depends on a lot of factors. If their target market is people who want vector search without needing to read a bunch of papers first, benchmarks might be more confusing than they are helpful.

edit: While I do think it's understandable, it's not great for transparency. Even if they don't want to open-source their index, I would admire it if they were willing to give ann-benchmarks[2] an API key to publish some independent results.

Disclaimer: I work on vector search at a different company

[1] https://web.archive.org/web/20210227105542/https://www.pinec... [2] https://github.com/erikbern/ann-benchmarks


FAISS


This is incorrect.


Strange comments and denial coming from the person who submitted a post about pinecone wrapping Faiss a year ago https://news.ycombinator.com/item?id=27502458


Yeah, what's this landing page about? https://www.pinecone.io/managed-faiss/


That's a very old and experimental page -- hence the "early access" mentions. Glad you pointed it out so we can delete it.


If you really think that's enough to build a real product, go for it. Even open-source companies (Elastic, Mongo, Scylla) have to build tons of infra around their core codebase in order to make it an actual cloud product.


Not that easy, the founder was a director at AWS. This is just devops/obfuscation on top of an open source library:

FAISS


Pinecone doesn’t use Faiss, nor ScaNN. We love Faiss and even teach people to use it[1]. There happens to be a sizable population of engineers who need more than what Faiss provides (like live index updates and metadata filtering, for example), and can’t be bothered or aren’t being paid to customize and manage open-source libraries all day.

[1] https://www.pinecone.io/learn/faiss/


So you guys developed and implemented state of the art neural network vector search from scratch? in a year? and something better than libraries with tens of contributors over years of research?


Most vector search research teams are a lot smaller than you suggest, and haven't been around that long (e.g. the FAISS paper was published in 2017).

From public info, you can see they have at least one researcher working there. It's believable to me that they could have some new innovations, especially since the product space they're focusing on is different from other teams working on vector search. State-of-the-art for a specific set of constraints is still state-of-the-art.

However, considering how much of their edu-marketing content is posted to HN, it would be great if they could share more details about the internals of their index with the community. One of the great things about vector search is how many techniques are open sourced or documented in papers :).

Disclaimer: I work on vector search at a different company


Many very competitive vector search libraries are done by small teams.

HNSW in NMSLIB[1] is mostly 3 people's work and it's very competitive[2].

[1] https://github.com/nmslib/nmslib

[2] http://ann-benchmarks.com/glove-100-angular_10_angular.html


I actually built a similar solution supporting similar operations (including filtering by meta-data) using open-source libraries. Took me about 2 weeks net.

I can see a clientele for such database (people who want a turnkey solution), but honestly it looks like an attempt to use a dev-ops solution to address deeper issues with problem formulation: e.g.

1. Is there really a need to search all items in the database? can subsampling make simple similarity comparison feasible?

2. Do the embeddings really need to have that many dimensions? Can we reduce their dimensionality and fit them in RAM?

3. Is embedding accurate enough compared to pairwise comparison? Can we formulate the problem to make the latter feasible?

I also could not find any explanation of the underlying algorithms, especially around meta-data filtering, which is not solved by FAISS as well as their accuracy. (happy to hear otherwise)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: