Hacker News new | past | comments | ask | show | jobs | submit login
Every database will become a vector database sooner or later (nextword.substack.com)
232 points by nextworddev on Oct 3, 2023 | hide | past | favorite | 138 comments



I think the move towards vector databases might be more hype than necessity. Traditional databases, when properly optimized, can handle vector data for many use cases. The push for specialized vector databases could be re-evaluated in terms of efficiency and cost-effectiveness compared to optimizing existing scalar databases.


Well you could store numbers all fine, but indexing vectors for similarity queries seems fairly recent and not all that widespread in the transactional world.

As the traditional db move forward in the space the need for dedicated vector databases will likely shrink, except for some very specific implementation that offer unique enough features (I.e. deeplake does vector search over object storage, which is very convenient for certain specific scenarios)


How is indexing a vector different from indexing a varchar or an integer ? If you convert a vector into a byteaarray it should be no different from a bytearray of varchar but for the bytearray contents.

Now if you want to do similarity search you have to measure the distance between 2 or more vectors and that's independent of the indexing. No ?

So any database with sufficient memory should be able to accomplish this as evidenced by the vector similarity search feature of Redis. ( I don't know how Redis folks have implemented vector similarity but they do support KNN search )


Mostly the number of dimensions. Assuming your vectors are float16, so 2 bytes each, you’d run into Postgres’ B+tree index limit (2704 bytes) very quickly. You could index a 512-dimension vector fine, but I believe most models are well beyond that.

There are alternative index types, of course, or you could index the hash of the vector. These both come with tradeoffs.


Btree isn't a very useful index type for a vector, though. GIN, GIST, and the handful of new extensions optimizing for vector search are what you'd want (and don't have this limitation).

Aside, you can increase the size of tuples you can index in a postgresql btree by increasing the postgresql page size (requires a recompile and creating a new database instance).


Agreed to the first, but you have to first know that those exist (and what they’re good for). This leads into my second point: IME, the Venn diagram for “people making AI stuff” and “people capable of compiling and running their own DB in a reliable manner” has no overlap.


Indexing in a vector engine is what gives you similarity search faster than brute force. The type of engine is what gives you various different distance measures (often approximate). Redis specifically has two choices: brute force (which is precise and slow), or HNSW (which is approximate and fast, but space-consuming for interesting dimensionality).


The distance computation could be separate from the indexing, but it will be inefficient relative to having an index organized to support the task.


sqlite has r-trees for instance [0]. Could it be good enough for most use cases? If it's to query a knowledge base for instance, a couple dimensions should be sufficient. With the added benefit of being able to query your data in other ways.

[0] https://www.sqlite.org/rtree.html


r*-trees work doesn't work well when the number of dimensions stored in the index is much higher than the logarithm of the number of indexed entries, and this is a prevailing property of divide-and-conquer spatial index types when the keyspace is divided based on a single dimension at a time. As vectors regularly have 100+ dimensions, normal spatial indexing methods applied to vectors wouldn't be very efficient for anything with much less than 2^100 index entries; which is quite suboptimal for most datasets that you would want to have indexed.


Also the distance metric for r*-trees is just plain wacky for anything other than low-dimensional Euclidean space.

Even if you could make it perform well, it would not do what you want.


Are you saying this because r-trees expect a proper metric space, and people have the need to index datasets over non-metric spaces?


The curse of dimensionality creates a seemingly paradoxical situation where you have a vast vast search space, but everything is incredibly close to each other. Space subdivision algorithms become ineffective.


Here is a SQLite extension that uses Faiss under the hood.

https://github.com/asg017/sqlite-vss

Not associated with the project, just love SQLite and find it very useful.


What is "vector search over object storage?" Does deeplake performs some computations on objects and search on their embeddings?


It stores everything on cheap storage, with no compute associated (i.e. S3) and uses the client to compute the query embeddings and to retrieve the embedding index and to run an indexed search to identify the data to be retrieved, and likewise the client does the work of updating the index structure on writing.

The benefit is that you don't have to pay for the compute part of a database, and the storage layer is as cheap as it could be on the cloud.


At the expense of latency ? when in fact latency is the most important aspect for any search. Any idea on how fast the searches from the client are ?

retrieve the embedding index and to run an indexed search to identify the data to be retrieved. Please bear with the layman like questioning -

So if the data is {obj: "obj1, "data": {"name": "atlas", "embedding": "1123124234" } What is an embedding index ? Is it something like {"1123124234": "obj1"} ?

From what I understand the query will be "geography" whose embedding will be "12311111" and now you have to run a KNN for a match which will return {"name": "atlas", "embedding": "1123124234"}

Not sure where the embedding index comes into play here.


Eh, sure latency is suboptimal. But if you have a LLM in the mix, that latency will dominate the overall response time. At that point you might not care about how performant your index is, and since performance/cost is non linear, it can translate to very significant savings


To be fair, Vector databases does sound more official as "new and important technology" compared to the last db hype of NOSQL.


> compared to the last db hype of NOSQL

NoSQL has been around for over 20+ years.

Since then Cassandra, DynamoDB, FoundationDB, MongoDB, Neo4J, Redis etc are not only still around but widely used and powering many of the services you use today.


That's true, but 13 or 14 years ago, the NoSQL hype was calling for widespread abandonment of RDBMSs. They were claiming key-value stores and document databases would replace traditional relational databases the same way Reddit replaced Digg. That "MongoDB is web scale" video was lampooning the very real hype that people had, and how difficult it was to pick through it. It was as bad as the more recent blockchain hype.

Even looking at HN's reactions to that video show a few comments that did not age very well (although most of them did): https://news.ycombinator.com/item?id=1636198

The sensible takes were that NoSQL would supplement and enhance RDBMSs, but the hype was much more than that.


The difference is that today those DBs are generally considered as complements to SQL databases targeting specific use cases. Back in the peak NoSQL days people were pushing the narrative that you'd never need to use a SQL database for anything ever again.


IMO this is the deal:

RDBMS is such a mature and powerful technology, and with the vast power in modern single-node hardware, they will scale to considerable sizes. But there is a limit.

Once a table or set of tables hit a certain scale, it needs to be distributed. Once you get to those scales, you are likely dealing with the threat of exponential data growth, so doing the "distributed Postgres" will bandaid the problem ... but you are starting to run into the CAP theorem's problems.

You'll need to AP-scale the biggest data tables, and since that almost always means a big coding change lift and introduction of an AP-scaling database, it will almost always be a six months to a year transition, and possibly adding an entirely new DB technology (Cassandra / DynamoDB / maybe FoundationDB).

I have yet to have anyone explain how joins scale on an AP distributed database except in limited situations where the joined data is somehow node-local to the other tables, usually some hierarchical situation. Otherwise you are pulling data from lots of nodes and aggregating and comparing the different sets to account for node drift / partitions / network failures. Cassandra and Dynamo basically say "you are scaling a single table/query/update/pre-joined data table".

Which really isn't fun for RDBMS folks. Because it is a shitload of denormalization on top of all the AP headaches and distributed transactions / updates.

As I said, megahuge machines really make that an outside case unless your use case really emphasizes the "A" in CAP, where write speed can be tolerated to be low and you want to tolerate entire cloud or datacenter outages.

But yeah, nosql was never going to kill the rdbms. The functionality/power of rdbms/SQL is so so so so much higher. Just be aware when you're about to shoot over the limit and prepare for the code changes to handle true scale in the (unlikely) event you're going to need it.


Great explanation, and I agree with one caveat - from my experience if you get to a truly large single RDBMS you have already lost, even if you have a powerful enough machine to run it on.

Massive RDBMS instances are awful to manage. Hard to backup/restore/fork. Hard to migrate tables and schemas without causing downtime. Accidental downtime happens all the time due to locks, bad indexes, bad query plans , etc etc. at large scale they are capricious beasts, care and feeding and most importantly changing them becomes a dark art.

Don’t let them get too big you’ll regret it :)


Oh, they're still at it. In my relatively recent experience Mongo will still try to sell you on the idea of abandoning relational workloads because migrations are hard and schemas are optional.


I mean NOSQL was hype with no substance but "you can scale more if you deal with not having ACID" is just generally true.

Of course ACID scales to well into the Fortune 500 scale so...


"No substance" seems a bit harsh.

They mostly seem a tarted up associative array, sure, but a key-value store is a thing.


A naturally distributed key/value store. NoSQL wasn't a product, it was a radical rethinking of the balance between what we wanted and what we needed. Turns out some absolute necessities of using a RDBMS or even using SQL are not that important, and relaxing those actually-not-requirements allows massive scalability, something we desperately needed, or some evolved data structures and computation, like what redis provides.


> but a key-value store is a thing

DynamoDB underpins much of AWS which in turn underpins a ridiculous number of web services.

So definitely more than just a thing.


But don't you prefer your key value stores to be wearing red lipstick and a pushup bra?


What would you use to compute proximity of vectors, for example?


The argument that combining traditional database and vector database into one because it reduces data movement doesn't compute for me.

Firstly, even for non vector data, read/write transactional database vs read-optimized store purely for fast serving are already markedly different. Then, the shape of data that is used to generate embeddings is markedly different than the shape of data that is ready to transact or serve.

So, no matter where it is stored, it has to leave that store, get transformed and enriched and then run through an embeddings generator (ML inference).

Then, it has to be stored in a manner that is optimal for retrieval ranking. If you are doing ANN that's one thing, but if you are doing attribute based filtering while retrieving and you wish to accelerate it through GPU to do more exhaustive search, that's another thing altogether.

All these lead to fairly sophisticated optimized implementation. Sure, a singular database product solution that has all these different optimized engines can emerge over time but surely it is too early today to converge like this.


The problem is maintenance of a separate database and operationally it would be more work. For us who want to use embedding features having a separate database just for the embedding means now we have to keep data across multiple databases in sync and to maintain


> So, no matter where it is stored, it has to leave that store, get transformed and enriched and then run through an embeddings generator (ML inference).

Same for pretty much all the data in your database. It comes from an app, that does validation/editing/transforms. The benefit is that it's all together, can be atomic, only requires one query to get, update, and delete.

Vector databases are just normal databases with a vector index. There's no reason for you to have a specialized DB for it.

Also, embeddings aren't inference, it's a token lookup. There's no forward pass.


The one DB fits all approach only works when the size of the database is really small and never grows. Imagine you have 100 customers. Each customer generates, on average, a million 1536 dimension vector embeddings (considering OpenAI Ada dimensions which is the most popular right now). That is 6GB (1536 x 4 bytes per dimension for f32 x 1000_000) of just embeddings PER CUSTOMER. If you use HNSW it will take at least that much of RAM if not more. If you use PQ (and variants) you can reduce the size of index in RAM to say 512MB-1GB per customer. It is still quite a lot of memory requirement. That is just the way it is and there is no way around it.

Now imagine you are using that database for storing transactions and other day to day business ops that will still be storing millions of records but with small indexes. This would have ideally only required a single DB instance with a replica for redundancy. Now if you integrate Vectors into the equation, you will have to needlessly scale this DB both horizontally and vertically just to maintain a decent query/write performance to your DB (which would have ideally been extremely fast without embeddings in the mix). You will eventually separate the embeddings out as it makes no sense for the entire DB to be scaled just for the sake of scaling your embeddings. I am not even accounting for index generation for these vectors which will require nearly 100% of all CPU cores while the index is being generated (depending on type of ANN you are using) and which in turn would slow your DB to a crawl.


Exactly - vector indexes are so different than traditional RDBMS B-Tree or LSM Tree indexes that it doesn’t make sense to use the same store for both unless it’s basically a toy app.

Someone makes the example in another comment, but it’s analogous to OLTP vs OLAP


I don't even want to imagine the workload on a high txn OLTP mixed with OLAP access pattern. IMHO If you can, you don't need OLAP in the first place.


My experience is that if you do the data modelling properly a well designed star schema with some aggregation tables or materialized views on top can often remove the need for dedicated OLAP software.

Now you do NOT want to run such a setup on the same hardware that you use for your transactional systems, of course. But you CAN use the same software (like Oracle), which means that you do get some reduction in tech complexity.


Are there any DB that could support both use cases while being able to partition them in such a way that the transactions etc are only kept on part of the resources they need to be. Basically two seperated DBs but sharing the same interfaces and security etc.


What you are talking about is possible to do in regular SQL dbs with extensions. However, when it comes to scaling traditional DBs don't have the necessary tools to do so automatically. Most extensions provide support for an underlying ANN algorithm it implements and there's that and nothing more. Everything else you'll have to hand roll yourself.

Clustering, load balancing, aggregating queries etc are quite different for a vector database in comparison to traditional OLTP databases.

It's the same as difference between OLAP vs OLTP. Both have different underlying architectural differences which make it incompatible for both to run in an integrated fashion.

For instance, in a traditional DB the index is maintained and rebuilt alongside data storage and for scaling you can separate it into read/write nodes. The write nodes typically only focus on building indexes while the read nodes for querying eventually consistent indexes (eventual consistency is achieved by broadcasting only the changed rows rather than sending entire index).

Now it's similar in vector dbs too. You can seperate the indexer from query nodes (which access eventually consistent index). However, the load is way higher than a regular db as the index is humongous/takes a long time to build and sharing the index with query nodes is also more time consuming and resource/network intensive, as you won't be sharing few rows but the entire index itself. It requires a totally different strategy to get all query nodes to be eventually consistent.

The only advantage of traditional DBs also implementing vector extensions is familiarity for the end user. If you are already familiar with postgres you wouldn't want to leave your comfort zone. However, scaling a traditional DB is different from scaling a vector DB and you'll encounter those pain points only in production and will be forced to switch to proper vector databases anyways.


We (PlanetScale) announced Vector storage and search today. If I am understanding your request it sounds like something could do. I would love to hear more if you are willing to chat? s@planetscale.com


Absolutely this is par for the course for distributed engines.. just not postgres and other single node engines which a lot of people here will tell you are all you need.. what they mean is that's all you need until you either have to introduce enormously complex application tier sharding or you're moving to a scalable engine


Supabase has pgvector extension and that’s enough for my limited RAG use cases. I dont really need to use anything beyond postgres. On the other hand, enterprise might find it easier/cheaper to buy a second db than migrating their existing db to whatever the latest version. I dont think it’s as simple


Exactly. We use Supabase too but are at a scale where it just made sense to use a second, dedicated vector db (Pinecone) than to bloat our Postgres db that has a completely different workload


Bloat your DB... or pull in an entirely new vendor and bloat your entire operational outlay.

I'd really love to know what kind of insane scale justifies that tradeoff...


That's a very exaggerated way to look at things lol. Nothing got bloated at all in this process, we are just using the right tools for the job. I'm a solo founder and the only backend developer. I can assure you this decision only made my life easier by choosing the correct tech from the get go.


Nothing exaggerated, your comment implied scale was your justification: in which case there'd better have been some crazy high load that just brought the tool you already had to its knees to justify paying an additional closed source platform and manually having to pipe data to it in addition to your main data store.

Of course if I sounded incredulous it's because I didn't think you had that scale, and it sounds like I was correct?


No, you're not correct. We have over $2M ARR and with the amount of data we are storing it would be downright stupid to use Supabase.

We don't also "pipe" our data to Supabase, we use a couple different data stores depending on the best use case. For example we also use R2 and Durable Objects.

Just because you have a hammer doesn't mean everything is a nail.


Well maybe we have different definitions of scale: I think my team spends about $2M a month on compute, so we don't pride ourselves on randomly pulling in new vendors.


You are incredibly dense


When you're playing checkers people playing by the rules of chess might seem dense.


It's pretty incredible how you know more about our needs, usage, and infrastructure than we do :D

Also don't you think it's funny calling out somebody else's tech choices when you have zero insight into it and when it's worked out perfectly for us?

By the way, how many TBs of vector data are you storing in Postgres and needing to retrieve with minimal latency?


I work on autonomous vehicles: we generate more data every day than you likely generate in 10 lifetimes of your CRUD app escapades.


Bragging about generating lots of data? Wow, cool buddy

Literally irrelevant lol


Don't forget your latency!


They are separate systems. We don't touch Postgres for the same code that needs to access Pinecone.


How do you deal with security and access control across postgres and pinecone?


We use Cloudflare Workers for our API and just handle auth calls by checking the JWTs with Supabase and caching it. So we already had the necessary auth setup to do this.

For basic CRUD we use the Supabase endpoints directly but none of that involves querying a vector db :P


recently ditched Supabase for Weaviate. I was tired of the python bindings not keeping up, no hybrid search, slower search algorithms. Also Supabase has a lot of features that i just don't need.


As a lover of array languages, I remember being excited to read a futurist article on vector processors and programming languages. It was written right before Wes McKinney worked on Pandas (the J programming language influenced him), and I thought J/APL or another array language was going to explode. J has Jd, in which J is fully integrated. This did not come to pass (yet). No matter, I still enjoy array languages anyway. There's a new array language, uiua[1], that is a mix of array and stack concepts with a good standard library including audio and graphics.

[1] https://www.uiua.org/


uiua looks like a perl programmer went mad


It looks like array/stack Brainfuck.

Previous uiua discussion: https://news.ycombinator.com/item?id=37673127

Brainfuck: https://en.wikipedia.org/wiki/Brainfuck


Appearances can be deceiving. In terms of expressiveness, Uiua is to Brainfuck as Python is to nand.


Exactly this. J and Uiua along with APL for me are very expressive. It doesn't take long to become comfortable with the glyphs. Sad to say, BQN's approach resonated with me, however, some subjective bias in me, didn't feel warm and fuzzy about the glyphs. And that's saying a lot, since I am good with J's ASCII noise, APL's glyphs, and now Uiua's glyphs. Sorry, Marshall! BQN is incredible. Maybe I had some sort of PTSD from my days at the Brooklyn Museum and some weird confluence of hieroglyphics and BQN glyphs!


It is true that every major DB ventor, SQL or not, is smashing the AI/vector keyword on their front pages. In Elastic for example, their vector capabilities have gone from laughable to respectable in a year. Its a lot simpler to just use one DB instead of many.

But a question for true DB experts here:

1. Is there any real advantage to building a dedicated vector DB from scratch?

2. Is vector DB something that can be just 'tacked on' to a normal DB with no major performance penalties?

We know from history, that data warehouses are genuinely different from databases, and cloud data warehouses are overwhelmingly superior to on-prem ones. So that emerged as a distinct, enduring category with Snowflake/Databricks/Bigquery.


Data warehouses are columnar stores. They are very different from row-oriented databases - like Postgres, MySQL. Operations on columns - e.g., aggregations (mean of a column) are very efficient.

Most vector databases use one of a few different vector indexing libraries - FAISS, hnswlib, and scann (google only) are popular. The newer vector dbs, like weaviate, have introduced their own indexes, but i haven't seen any performance difference -

Reference: https://ann-benchmarks.com/


Elastic does a great job with it. My one-person company builds software that mirrors emails from IMAP accounts to ElasticSearch, and adding a vector search on top of that data to "chat" with emails was fairly simple. I was expecting there to be an untold number of hurdles, but the only requirement was to have at least v8.8.0 of ElasticSearch (this was when they increased the supported vector sizes so that OpenAI embeddings would fit into it), and that's it. https://docs.emailengine.app/chat-with-emails-using-emaileng...


Single node database systems that are not horizontally scalable and that are not built on a distributed system foundation (e.g. Postgres) will certainly have scaling bottlenecks if you just add more and more complexity to the workload... however many modern database systems are built on a distributed system foundation with horizontal scaling and the ability to independently scale different constituent parts of the backend.. these engines should have no problem


The trade off that you are interested in isn't about storing vectors, but rather about whether an index should be a part of the DBMS or external to it.

Some advantages of having a separate index is that it can work with different backends, it can be independently scaled, and it can index data for more than 1 database server.

Some disadvantages are increased latency, increased complexity, and distributed system problems.


After spending the last six months working with a vector database I qualified postgres with its vector extensions this morning and I am trying to toss out everything else.

The operational pains if you need to self host this stuff are real, split brain, backup/restore not really considered (compared to a normal databases features), things like replication and sharding _exist_ but often are a buggy mess.

OLAP is definitely distinct from OLTP, and most of these vector queries have some aspect of both - they are similar to OLAP in that they need a decent amount of preprocessing to be useful (inferrence) and they are similar to OLTP in that they are often used for serving point queries or tiny lookups.


> It genuinely makes sense for incumbent database players to offer vector search, because that eliminates unnecessary data movement to separate vector databases. Co-locating vectors and original documents also reduces latency.

Yet OLAP databases continue to thrive alongside OLTP databases, the nascence of NewSQL hybrid (HTAP) databases notwithstanding. Different needs dictate different design choices for optimality.


Different needs dictate different design choices for optimality.

Could not agree more. Even for time series, which could be seen as a subset of OLAP, trade-offs and design choices inherent to time-series data are necessary. As an example of a TSDB that I know well, QuestDB: Data is always ordered by time once it lands on the disk, the data is partitioned by time, and the ingestion protocol is conceived to stream large volumes of data, which can be either continuous or in bursts.


It’s interesting that I never considered why OLTP and OLAP are basically orthogonal technologies. Are there any major players that have an integrated solution for both?

I guess it makes sense because the infra is so different, but I’m not sure whether it need be.


I think they have, they are just not that well known. E.g. SQL Server - https://learn.microsoft.com/en-us/sql/relational-databases/i... - you can also find quite a lot of papers by microsoft employees on the designs and capabilities (starting around 2016 I believe, so "pretty new"). I have used it with TPC-H and it worked wonders, never got around to using it in a production workload though.


As far as the big players are concerned, Google offers AlloyDB (https://cloud.google.com/alloydb) while Amazon offers Aurora (https://aws.amazon.com/rds/aurora/)


I had understood Aurora as just cloud-native MySQL / PostgreSQL

How does that relate to the OLTP vs OLAP dimension? Are they not both primarily OLTP dbs still?


Effecient OLAP queries need a different shape of data - some combination of columnar storage for efficient scanning, and roll-up tables with pre-aggregated measures. Even in an integrated scenario, behind the scenes there will need to be a bunch of copying to transpose and / or refresh roll-ups.


The pre-aggregation is one too many try and skip. People seem to think they can build a single schema to rule all things, and then assume they can quickly calculate any aggregation on demand.


Is anyone considering a new OLAP system these days? If “NewSQL” (which seems to be a fancy buzzword for running analytics in your transactional database) takes off wont it be the final nail in the coffin for OLAP?


"NewSQL" is not about analytical performance, it's about multimaster distributed writes and reads. EG, you don't want to run count(*) on spanner.

"HTAP" is the buzzword you're looking for. It's promising, but also complex and nascent. It'll be interesting to see how much traction it gets over time, but things like TiDB and Unistore are pretty early on to call a nail in the coffin for redshift/bigquery/clickhouse etc.


I think the opposite concept, "reverse ETL," is actually more popular. You put everything into your data warehouse and then pump whatever you need out from there.


Not if you expect to do a bunch of data transformation before you materialize it into one of these engines.. if the way you want a materialize it is materially different than the way you handle it transactionally you're going to be moving it anyway and if you're moving it why not put it into an optimized context


Implementing vector DB architecture isn't complex (so I'm bear on commercial vector dbs as a net value add especially since they all wrap some oss ANN solution)

You pair a vector db with a metadata store (can be anything, but ideally you want low gravity with the vdb and disk for fast retrieval... i.e. leveldb, sqlite equivalents... or hell a traditional db - and the author is right traditional dbs dont need to work too hard to create a ANN extension)

In general, the greater challenge is the data engineering and overhead of managing retrieval stores (in terms of the data-integration/model pipelines) ... so I'm bull on the solutions addressing opportunities here.


Sure, some companies will use it. Other companies will continue to use specialised focused tools.

It's why data engineering is a thing in our industry. We move and prepare data for a set of tools, and we pay good money to do so, because we believe we derive value from those tools.

Let's say MySQL offers it, anyone already using MySQL is likely to fence the MySQL instance(s) focused on vector stuff off for various reasons (resilience, different read/write patterns, security, etc.)

MySQL as the (imaginary) basis only offers some transferable skills, because this DB will require different care and feeding.

Like the difference between Postgres and PG with cstore_fdw, similar, but sufficiently different.


I think just as with full-text search, vector search, if supported, will be full of tradeoffs for general purposes databases.

The view that everything needs to support direct input for generative AI is short sighted. There are other use cases as well. Even if ultimately these will become just building blocks for whatever AGI there comes. Horses for courses


I just googled this now if MariaDB offers a vector search and the first hit is a stack overflow question from me in 2014. If they jumped on it then they could have been ahead of this AI business but noooo...


Just received email from Planetscale about their for fork of mysql to support Vector

> PlanetScale has forked MySQL to add vector storage and search! You’ll be able to support your AI and ML applications with the world’s most scalable database platform. This unifies the reliability and functionality of MySQL with the ability to store vectors and perform similarity search.


Maybe I’ve been using PostgreSQL too long but when faced with the choice of adding vector support to PostgreSQL or using a new technology, my first choice was to start with the PostgreSQL addition.

I’m not criticizing the specialized case for a true vector database, but for most workloads I agree that the big database players will be the right choice for many users.


I'm in the same boat. For things that aren't huge scale, it's almost easier to find an extension or otherwise beat Postgres (or SQLite or Percona MySQL) into submission for your use case. Timescale is a really good example...I was really impressed how good the performance was for bigish (1 TB+) real time scientific time series data, even on a cheap Amazon Lightsail instance.


Pgvector's 2000-dimension limit on vector indices is annoying. There are workloads I want to push at it which are in the 5000+ range, so there's an extra dimensionality reduction step I need to build.

My starting point for pretty much any storage problem is "have you tried throwing it in postgresql" but this is one reason pulling something else off the shelf might be a good idea.


I am super ignorant about vector databases, but don't they store n-dimensional vectors, and aren't they used to answer geometric(ish) problems, getting the k closest entries to a particular point?

This sounds a problem more similar to what 3D physics engines can do, but generalized to higher dimensions, as opposed to traditional text and key-based database stuff.

The algorithms and data structures in physics engines (bounding volume hierarchies, kd-trees, etc.) are quite different from how a traditional database index (B-trees and skip lists are popular there, if I remember correctly) is searched and stored.


You are right about this, but in this case a vector would represent a collection of words. Getting the k closes entries would show the k closest text entries to a given query. This is super interesting for "semantic" search, where you are looking for meaning as opposed to just a textual match.

For example, the text "chocolate milk" is all the same characters as "milk chocolate", but likely have very different usage within the context of retailers or cooks. So, their vectors should be very different.

Word2Vec is NLP that uses a neural net to build these vectors: https://en.wikipedia.org/wiki/Word2vec


Yeah, they do store n-dimensional vectors, but they're part of the LLM zeitgest. The basic idea is that given a particular corpus of text, an LLM can produce a very high-dimensional vector that encodes the features of that text, according to its training parameters. Vectors that are near each other in the high-dimensional space represent pieces of text which the LLM "thinks" are similar to each other.

So, by using databases which can efficiently answer geometric nearest-neighbor questions, you can quickly search for chunks of text that are similar to each other.


I do agree with the article that this feature will be more or less available in all DB types.

Vector databases are a gimmick at the moment. Ultimately conversational AI agents should be able to extract information from a diverse set of sources with a diverse set of tools. The approach that is currently taken is hit-and-miss at best. How often do you searched something and the first result happens to be the thing you are looking for? Why should it be any different with vector DBs? Obviously the query matters a lot no matter how the information is searched.


hard disagree. Extracting information is much more costly (fetch data, feed data (which might be huge) into the model).

Embeddings work really well to store semantic meaning and are great for searching. Or, at least, a 1st stage of searching to filter out the non-relevant content.

I'm working on my own "notes" app, based on embeddings because I 'm tired of never finding what I need due to bad search/tagging/categorizing


>How often do you searched something and the first result happens to be the thing you are looking for?

Searched where? Every search powered by a large tech company has almost certainly been using vector search for years. Then combined that with other non-vector results. Then run that through numerous ranking models. Then showed that to you. It's far from perfect but absurdly better than the average ElasticSearch results you might get elsewhere.


The benefits of using specialized vendors like pinecone is about their offering a combination of performance (clustered/loadbalancing), fast/effective algorithms, and data storage.


This article is extremely correct and true, bordering on obvious. Vectors are a feature of a database engine that all engines will eventually offer -- not a new category of databases.


Would you say the same about graph databases? (e.g., Neo4j, ArangoDB, Neptune)


What's interesting about graph is that it's really an ad hoc analytics use case ... it's not for operational / transactional.. this is what most people don't realize. If you have an at scale graph workload, like for example if you're facebook, you build your graph on top of an operational/transactional backend.. like in facebook's case heavily customized MySQL.

I bring this up because the ad hoc analytical use case for graph stores is so niche most engines haven't even seen enough demand to introduce it because you can always store graph relationships and offer retrievals in those engines to a limited depth which is typically sufficient for most operational / transactional use cases.


every 'Every X will become a Y sooner or later' title will become 'Every X becomes a monoid sooner or later' sooner or later


There is a great write up by Jonathan Ellis, Apache Cassandra committer and DataStax co-founder on hard problems with Vector Search. https://thenewstack.io/5-hard-problems-in-vector-search-and-...


I think the only thing that can tell us whether or not every database can be a vector database is time. We have seen this with Time Series databases that they are their own unique type of database, and I believe this will be the same with Vector Databases.


>I think the only thing that can tell us whether or not every database can be a vector database is time.

I can help you put right now. You don't even need a database to have a "vector database". FAISS is an in-memory "vector database" that runs on the data that you happen to have. So: if you data is stored in a .txt file, load it into memory, index it with FAISS... bam a vector database from a data file.

Can we take X arbitrary "real" database and implement KNN search on top of 1000 indexed columns, I'm sure it is possible - but I'm also pretty sure most databases will die under the pressure (source: I've asked some of my favorite DBAs if I could do this and they said "no")


I still haven't figured what a vector DB is, beyond something something AI.


VectorDBs let you retrieve documents that have textual similarity.

They allow you sort results by the cosine similarity[1] between vectors. The idea is that you can attach a vector to documents in the database and then when you pass a vector in the query and get back documents that most match the query vector.

The function that creates these vectors(string -> vector) is called an embedding and is constructed in such a way that "semantically similar" strings have vectors that are close together.

It's not a very complicated idea, but complicated and powerful are orthogonal concepts.

They are useful in AI(LLMs) when you would like to include documents in your prompt that are relevant to your instruction. The best way to describe this is by example.

Imagine your query is "What is the capital of France?" Rather than requiring your LLM to encounter this fact during its training, you can embed the question("What is the capital of France?" and retrieve documents (say you've indexed all of Wikipedia in your vector db) and return some snippets from articles that include this information(context).

You then pass the the prompt+context to an LLM and given that it now has relevant information, it can answer the question.

You can also imagine that it's much easier to update a vector db with new information than it is to retrain a model to ingest new facts.

1. https://en.wikipedia.org/wiki/Cosine_similarity


Imagine: you need a data structure which allows you to store vectors in memory and then say "here is a vector A, give me the 10 vectors which are closest to the same direction of this vector A, in n dimensional space".

One can imagine there is some optimal way to lay out the data in memory such that it would be relatively quick to do that. One can imagine a naive way to do it which probably wouldn't be fast. A vector DB does the first thing - lays out the data in a way which then enables it to do that fast.

And then it does all the other stuff a DB does - persisting to disk (which means data needs to be laid out in a sensible way on disk too), handling multiple queries, updates, and the 50 other complicated things databases tend to do. Users generally want to do other operations on vector as well so a vector DB does those too.

For a small number of vectors you can build a vector DB yourself. Write a list of vectors to a file, load them into memory in no particular order, then for your "n closest" function, just iterate through the list calculating the difference in direction one by one and keeping the top n. Your simple system will work just fine for a toy demo.


I think the better question is, what the hell is a vector in the context of normal business logic? Sure, a word embedding vector makes sense because it's all an abstraction anyways, but if I have an "Employee" table with name, address, position, etc columns, how does that translate into a vector?


Oh, it doesn't. Vectors in this context are used to semantically represent unstructured text, not structured data that you'll find in a table of a sql database (except a big fat text field).

Here:

https://chat.openai.com/share/9e557a90-e127-4654-9271-7c51fd...


I was hoping the article would say and then when it didn't I was hoping the comments would, but so far no luck...


Wow you are so right. I just typed “best vector database” on Google and I have 4 ads with other results all talking about “vector database something AI”.


Curious how this will work in practice as vectors are specific to a given embedding model, and could be domain-specific for better results. Could it lead to industry standard embedding models, with regular (costly) upgrades?


I'm wondering the same thing. Standardization would be interesting but I wouldn't bet on it. Maintaining different vector columns for different models might work well?


Postgres and Mongo support it. ElasticSearch incorporates vector search. I have failed to identify an objective difference(beyond marketing nomeclature) between any pure play vector solution after exaustive research.


Lots of focus on RAG here, and rightfully so, but I feel an overlooked benefit to vector databases is the novel visualizations they provide. To be able to plot qualitative data onto a 2d graph with tsne reduction provides a new method with which to draw insights from. I think many companies would benefit from such a visualization tool, especially those in qualitative research.


Unfortunately that visualization doesn't really work with vector dbs. Vector dbs normally split their data into separate segments and build indexes on them separately. There is no one overarching index but rather many small ones that are searched in parallel. In addition to this, such a high compression to reduce it down to 2d/3d ends up becoming a giant blob without too much information.


Maybe I'm misunderstanding what exactly a vector db is.

Let's say you have a chatbot and stored in its database is the usual info like session id, timestamp, message etc. To "vectorize" this db then would be to vectorize all the messages. Is this too simple an understanding?

Once the db has been vectorized then we can do semantic search on the messages and create more informative graphs based off the semantic similarity for messages within a given timeframe or other criteria.


FYI, a list of vector databases (including announcements). I try to keep the list as up-to-date as I can: https://medium.com/google-cloud/vector-databases-are-all-the....


the interest in vector database rise as an external, specialized service in a system that runs in addition to the "single source of truth" data lake where the data actually resides. like Redis and Memcached before, it solves a specific problem. Redis started to act like a fully fledged DB (with weird persistency method and guarantees) only after it was really wide-spread. sure, every DB will support vector and every vector search engine will act like a DB. But that's missing the point that vector search is an expensive problem with tradoffs that justify a specialized design

btw I'm working in a DB startup - https://hyper-space.io/


We recently added vector support to FeatureBase: https://cloud.featurebase.com/. What's interesting about using it with our existing b-tree storage layer is that I can now do set operations on the record's stringset fields before I do the similarity calculations on the vectors. I've been experimenting with laying out the vector space in a b-tree structure using this technique along with keyterm extraction, allowing for a type of semantic graph to be applied by the LLM to the vector space.


Go ahead and try to load a billion embeddings into Mongo or Elastic. The author says this will be "faster, cheaper, and simpler..." Will it?

(Disclaimer: I work for Pinecone, so obvious bias ahead but also perspective of 3 years since launching the Vector DB category and actually seeing billion-scale vector search deployments.)

> Basically, having separate vector DBs can add to cost and complexity. Imagine you were a MongoDB shop, with over 500m documents stored cross-region. If you are using a separate vector DB, say Pinecone, that may require moving potentially billions of embeddings between two databases, cross regions. This costs a lot, not to mention complex, since you are responsible for generating the embeddings... It’s faster, cheaper, and simpler if one database (Mongo, Elastic) just supported vector search.

If you want, say, 100ms search latency on just 100M vector embeddings in Elastic that'll already cost you $12,600 per month at minimum. And if you regularly write new or updated data to the index then your latencies will creep up until eventually you have to run a "force merge" which will grind your vector search to a halt for several hours (so much for easy and simple). I don't know how much it is on Mongo but given that it's bolting on the same vector index I would guess it's in the same ballpark. The cost grows sublinearly with more embeddings. (Pinecone is around 60% less than that, and will be even less soon.) The suggestion that having "billions" of embeddings in a traditional DB is easier and less costly shows you exactly why you should run your own tests and see for yourself.

When traditional database companies bolt-on vector indexing libraries such as HNSW[0] on top of their existing architecture, it's to meet demand from their existing users that have a relatively basic need for vector search.

For very basic and small-scale use cases, like <10M vectors with a relaxed data freshness requirement, you should just use whatever is the most convenient. Sometimes that's Pinecone, and sometimes that's the database you already have. (And if your current DB doesn't offer basic vector search, just wait two days).

When it comes to larger scale, like 100M+ vectors, if you want any hope of meeting performance, cost, and data freshness requirements then you should look at a purpose-built vector database. As GenAI workloads start to enter production and scale, a lot of people will see find this out the hard way.

This has been true for every unique data structure and querying pattern for the past 40 years and it’s true for vector embeddings and vector-based retrieval. You can't blame the proliferation of different database types on hype and VC funding alone.

But don't take my word for it either. Go and run some tests that resemble your production workloads, then do what makes sense for your use case!

[0] https://www.pinecone.io/learn/series/faiss/hnsw/


I am curious, do you see a lot of transactional / high velocity updates to vector embeddings in an underlying operational database system? Guess we'll see more of that in future and still at the beginning? I ask b/c you describe regularly writing the data and latency to merge/keep up BUT one could argue that using a completely distinct system makes it even further from the source of truth and hence inherently harder to keep up


Yes, for example any site with user-uploaded content — think marketplaces, social media, SaaS file/docs storage, etc. Very high write throughput and expectation of that data being available in search/chatbots right away.


But in a sense doesn't that lend credence to the idea that the closer the indexing is happening to where the data is born, there may be some natural advantages?


The “force merge” process in an Elastic vector index took as long as 18 hours in one test. That’s a lot longer than a few hops across the network.

This is not even a dig at Elastic. The problem is deeper than that… It’s an issue with the underlying vector index they (and many others) chose to bolt on, HNSW, which was not designed with frequent live updates in mind.

We have a post coming soon that covers the technical parts of this in more detail. You asked a good question.


Got it (now you've got me going down the rabbit hole here reading things like https://stackoverflow.com/questions/60226215/why-segment-mer...). I think the key question is whether a healthy index in operation with a healthy amount of resources and configuration should ever require a force merge or if that's effectively an anti-pattern that suggests you're already up a cul-de-sac of risk and from there it sounds like if you have any segments larger than 5 GB in particular you may find yourself stuck unless you're an expert.

I'd love to see someone who has some expert knowledge of elastic chime in to hear if the characterization here seems right? But admittedly not everyone's going to be a power user so if this isn't easy there's definitely a problem.

It definitely makes sense that the faster the rate of change the more likely the engine has to combine results from multiple places until optimal placement on disc is found in steady state later, and hence query latency would rise.


Your argument is reasonable, however, you haven't explained why a purpose-built vector database is more scalable than any other DB with vector search support. Why is Elastic much slower than Pinecone? Is it because Pinecone is optimized in ways that Elastic cannot be? Is it because the Pinecone team has much better understanding on how to optimize vector search algorithms? or is it something else?


Blog post coming soon about this. The TL;DR is the underlying architecture matters… a lot. More coming soon.


The reason everyone's rushing to build vector databases is because they've tried to store vector data in a scalar database previously and realized they're hot garbage for the workload.

You should probably try it too before blogging about it.


That's not been my experience with distributed technologies. Are you referring to postgres on an oversaturated system?


If vector database isn't a category how can every database fall into that category sooner or later?


Larry?


[flagged]


> X are overrated!

That statement can definitely be true for many technologies, either because of a lot of hype and promises surrounding them (e.g. the history of MongoDB and how the NoSQL movement was like initially), or due to the mistaken belief that you need them from the very beginning of development even in cases when they are a good fit. I guess the first part is basically describing the Gartner hype cycle, which feels vaguely truthful: https://en.wikipedia.org/wiki/Gartner_hype_cycle Even good tech is susceptible to this, until people actually figure out what it's best used for and when.

If you're in the early stage of developing a prototype, or are working on a system for a small business, then a single RDBMS can indeed work for most use cases - even when you need to store JSON, use full-text indices or even store a reasonable amount of binary data (or even a NoSQL solution, depending on the constraints). It's the same with how you can build an entire business on a monolith written in boring tech, like .NET or Java, with whatever you want for the front end.

Eventually, you might be well served to branch out - at the point where the need for something more specialized becomes more pressing and you can actually afford the time and human resources to manage the complexity/integration effort and so on, or even hire people proficient in that particular tech in the first place. If you have an entire team just for search, then you're probably at that point. If it's just another ticket in your issue tracker for a developer or two to implement, just use the simple index approach with hopefully good enough results.

If it's a greenfield project and you're doing something very novel, all bets are probably off, though.


Personally I find such snark comments to be a really bad look from someone who apparently is associated with a product (judging from comment history).


> 𝐌𝐞𝐦𝐨𝐫𝐲 𝐢𝐬 𝐨𝐯𝐞𝐫𝐫𝐚𝐭𝐞𝐝! Forget everything. Do not fill your brain with new information. Do not learn new stuff! Just relax and enjoy life!

Okay, but like... as someone with a dissociative disorder this is scarily accurate

forget everything! Wake up one day and don't remember what you did the previous day. It's fine. Constantly run into friends you don't remember making. It's fine. Just be happy. Just be happy Just be happy Just be ha


Super lazy comment. SQL databases got JSON handling. NoSQL databases added indexes and ACID compliance.

An entire new class of databases that operate identically to existing ones with the exception of a single column type is silly, it's only getting traction because of aggressive VC-funded marketing.


Would you say the same about Keyword Search engines like Elastic, Solr, etc? It is just another column type, full-text index, that is available in any proper database. Just a hype...


Look up what Lucene is and how it differs from what an RDBMS offers for full-text search and you'll be able to answer that question for yourself.

I had a feeling the heavily overdone sarcasm in the original comment was a way to mask a lack of understanding of the subject, and here you confirmed it.

_

And maybe call out the fact you're the founder at a vector database SaaS? Definitely explains why you'd put out that original diatribe...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: