Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How large is the data size (the number of vectors, and their dimensions?), what are the type of queries (N nearest neighbors to a target vector according to L2 distance, or something else?), from where the queries are sent (reccomendation system for a user request; internal requests from a ML system), the throughput and the latency requirements (how many queries per second it should serve and how quickly it should answer)?

ClickHouse already works good for vector search.

For example, if you have one million of vectors of 1024 dimensions, and you search nearest vectors to a vector by brute force search, the query will take 150 ms, which is good for a reccomendation system scenario for e-commerce, food-tech, and similar applications.

Example:

    CREATE TABLE vectors (id UInt64, vector Array(Float32)) ENGINE = Memory;
    SET max_block_size = 16; -- 64 KB per row
    INSERT INTO vectors SELECT number, arrayMap(x -> randNormal(0.0, 1.0, x), range(1024)) FROM numbers_mt(1000000); -- 4 GiB

    WITH (SELECT vector FROM vectors LIMIT 1) AS target
    SELECT count() FROM vectors WHERE NOT ignore(L2SquaredDistance(vector, target)); -- 0.113

    SELECT count() FROM vectors WHERE NOT ignore(L2Norm(vector)); -- 0.110

    WITH (SELECT vector FROM vectors LIMIT 1) AS target
    SELECT count() FROM vectors WHERE NOT ignore(arraySum((x, y) -> x * y, vector, target)); -- 0.150

    WITH (SELECT vector FROM vectors LIMIT 1) AS target
    SELECT id, L2SquaredDistance(vector, target) AS distance FROM vectors ORDER BY distance LIMIT 10; -- 0.144


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: