The most cost-effective Vector Databases

binarymax · on May 17, 2023

Hey there - interesting solution!

A note on benchmarks with HNSW is that you need to optimize your ‘m’ and ‘ef_construction’ params at index time and your ef_search param at query time.

So you may not be getting an optimized result from DBs like Qdrant and weaviate.

lqhl · on May 18, 2023

That's true and we have tried different configurations for systems that use HNSW. But for the ease of presentation, we only choose the configuration with the highest throughput at precision 98%.

Here is a figure in our open-source benchmark framework repo that shows other configurations that we have tested: https://github.com/myscale/vector-db-benchmark/blob/master/i...

renaissancist · on May 17, 2023

we actually tuned `m` (e.g. 32), `ef_construct` and `ef_search` (e.g. 128-256) to be fairly large values for best performance, but not able to obtain better results. Overall HNSW consumes a lot of resources, so it's not as cost effective.

LukeAI · on May 17, 2023

Looks interesting! a high-performance vector database service that has full SQL support, built on top of ClickHouse (famous for its speed as well), and is also cost effective to use.

siyud · on May 17, 2023

It's really exciting to see a SQL based vector database with such high performance. I can't wait to try it.

kacperlukawski · on May 17, 2023

Did you run the clients in the same regions as the servers? That may impact the results.

lqhl · on May 18, 2023

Yes. For MyScale (aws us-east-1), Pinecone (aws us-east-1), Qdrant (aws us-east-1), and Zilliz Cloud (aws us-east-2), we run the clients in the same region as the servers. For Weaviate, the server is in GCP US east, while the client is in aws us-east-1. Since its throughput is around 66 QPS, the impact of networking should be low.

renaissancist · on May 17, 2023

Yes, we did. If the QPS is high, running the clients in the same region is important for performance.

lqhl · on May 17, 2023

Over the past few months, we have been working on an exciting project to bridge the gap between high performance vector search and OLAP database. Today, we are thrilled to announce the release of our new end-to-end benchmark of MyScale, which includes a comparison with some of the state-of-the-art vector databases for your reference. Here are some key takeaways that might pique your interest:

1. [Low Cost] in this case, we measures the ratio of the monthly cost to the QPS (Queries Per Second) of the service per one hundred units. It quantifies the monthly cost required to achieve 100 QPS on 5 million vector data points. Our analysis highlights the superior cost-performance ratio of MyScale, which is over 3.6 times cheaper than other vector databases. https://blog.myscale.com/2023/05/17/myscale-outperform-speci...

2. [High Throughput] MyScale outperforms other vector databases in terms of QPS on the LAION 5M dataset with a 98.5% recall rate, achieving over 150 QPS. In comparison, Pinecone s1 has a QPS of approximately 10, which is significantly lower than MyScale. Weaviate and Zilliz Cloud both achieve around 65 QPS, while Qdrant achieves 81 QPS. https://blog.myscale.com/2023/05/17/myscale-outperform-speci...

3. [Quick Responce] Query latency is an important performance metric that is measured from the time the client sends the request until it receives the response. MyScale achieves 150 QPS while maintaining an average latency as low as 25.8 ms. Pinecone s1 has a relatively high latency of over 400 ms. Weaviate and Zilliz Cloud both have latencies of around 60 ms, while Qdrant has a slightly higher latency of around 100 ms. https://blog.myscale.com/2023/05/17/myscale-outperform-speci...

4. [Fast Data Ingestion] The time it takes from data upload to the vector index being built and ready to serve is referred to as data ingestion time. Index creation can take a long time, especially for graph-based algorithms such as HNSW. Among all the services tested, MyScale had the fastest ingestion time for 5 million data points, completing the task in about 30 minutes. Pinecone s1 takes approximately 53 minutes, while Weaviate takes 72 minutes. Zilliz Cloud requires a longer duration of approximately 113 minutes, while Qdrant has the longest ingestion time, taking 145 minutes to process 5 million data points. https://blog.myscale.com/2023/05/17/myscale-outperform-speci...

Also here are some other nice features myscale can offer:

1. Simple data import and backup: We support common format like Parquet, tar, csv from or to S3 buckets or other object storage systems.

2. There are more options from FAISS and HNSW other than MSTG, the algorithmn we proposed and tested in this benchmark. You may choose one you familiar with.

3. Built on Clickhouse and being part of the community. Boost your vector search with Clickhouse advanced features. MyScale is currently in beta, with a free developer tier and a commercial plan on the way. To the best of our knowledge,

MyScale offers the first free plan that supports 5 million 768-dimensional vector data points with high performance search.

zhangjmruc · on May 18, 2023

Great job!

Windfall_ · on May 18, 2023

Great job! I can't wait to try it.