How large is the data size (the number of vectors, and their dimensions?), what are the type of queries (N nearest neighbors to a target vector according to L2 distance, or something else?), from where the queries are sent (reccomendation system for a user request; internal requests from a ML system), the throughput and the latency requirements (how many queries per second it should serve and how quickly it should answer)?
ClickHouse already works good for vector search.
For example, if you have one million of vectors of 1024 dimensions, and you search nearest vectors to a vector by brute force search, the query will take 150 ms, which is good for a reccomendation system scenario for e-commerce, food-tech, and similar applications.
Example:
CREATE TABLE vectors (id UInt64, vector Array(Float32)) ENGINE = Memory;
SET max_block_size = 16; -- 64 KB per row
INSERT INTO vectors SELECT number, arrayMap(x -> randNormal(0.0, 1.0, x), range(1024)) FROM numbers_mt(1000000); -- 4 GiB
WITH (SELECT vector FROM vectors LIMIT 1) AS target
SELECT count() FROM vectors WHERE NOT ignore(L2SquaredDistance(vector, target)); -- 0.113
SELECT count() FROM vectors WHERE NOT ignore(L2Norm(vector)); -- 0.110
WITH (SELECT vector FROM vectors LIMIT 1) AS target
SELECT count() FROM vectors WHERE NOT ignore(arraySum((x, y) -> x * y, vector, target)); -- 0.150
WITH (SELECT vector FROM vectors LIMIT 1) AS target
SELECT id, L2SquaredDistance(vector, target) AS distance FROM vectors ORDER BY distance LIMIT 10; -- 0.144
ClickHouse already works good for vector search.
For example, if you have one million of vectors of 1024 dimensions, and you search nearest vectors to a vector by brute force search, the query will take 150 ms, which is good for a reccomendation system scenario for e-commerce, food-tech, and similar applications.
Example: