Hacker News new | past | comments | ask | show | jobs | submit login
90x Faster Than Pgvector – Lantern's HNSW Index Creation Time (lantern.dev)
109 points by diqi 9 months ago | hide | past | favorite | 36 comments



Nice to see people care about index construction time.

I'm the lead author of JVector, which scales linearly to at least 32 cores and may be the only graph-based vector index designed around nonblocking data structures (as opposed to using locks for thread safety): https://github.com/jbellis/jvector/

JVector looks to be about 2x as fast at indexing as Lantern, ingesting the Sift1M dataset in under 25s on a 32 core aws box (m6i.16xl), compared to 50s for Lantern in the article.

(JVector is based on DiskANN, not HNSW, but the configuration parameters are similar -- both are configured with graph degree and search width.)


Sift 1M is too small to make meaningful comparisons. Storing 96 floats * 1M only takes up 800Mb of memory.


This reads like a marketing piece, not an honest technical blogpost.

I agree that Usearch is fast, but it feels pretty dishonest to take credit for someone else's work. Like maybe at least honestly profile what's going on with USearch vs pgvector (..and which settings for pgvector??), and write something interesting about it?

The last time I tried Lantern, it'd segfault when I tried to do anything non-trivial with it, and was incredibly unsafe with how it handled memory. Hopefully that's at least fixed.. but lantern has so many red flags.


USearch author here :)

Not sure if it's fair to compare USearch and pgvector. One is an efficient indexing structure, the other is more like a pure database plugin. Not that they can't be used in a similar fashion.

If you are looking for pure indexing benchmarks, you might be interested in USearch vs FAISS HNSW implementation [1]. We run them ourselves (and a couple of other tech companies), so take them with a grain of salt. They might be biased.

As for Lantern vs pgvector, impressed by the result! A lot of people would benefit from having fast vector-search compatible with Postgres! The way to go!

It's wasn't a trivial integration by any means, and the Lantern team was very active - suggesting patches into the upstream version to ease integrations with other databases. Some of those are tricky and have yet to be merged [2]. So stay tuned for the USearch v3. Lots of new features coming :)

[1]: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search... [2]: https://github.com/unum-cloud/usearch/pull/171/files


Hi, sorry that you didn't have a good experience with Lantern before. We first posted in HN about 3 months ago - Things should be better now, please let us know if you have any issues.


Using ivfflat is much faster for bulk index creation than lantern. There are a lot of trade offs depending on what everyone's specific use case is, but it seems like a pretty massive thing to leave out.

``` postgres=# CREATE INDEX ON sift USING ivfflat (v vector_l2_ops) WITH (lists=1000); CREATE INDEX Time: 65697.411 ms (01:05.697) ```


cofounder here.

You are right that there are many trade-offs between HNSW and IVFFLAT.

E.g. IVFFLAT requires there be significant amount of data in the table, before the index is created, and assumes data distribution does not change with additional inserts (since it chooses centroids during the initial creation and never updates them)

We have also generally had harder time getting high recall with IVFFLAT on vectors from embedding models such as ada-002.

There are trade-offs, some of which we will explore in later blog posts.

This post is about one thing - HNSW index creation time across two systems, at a fixed 99% recall.


External index creation also requires that a significant amount of data be in the table for it to be worth it, along with all the other potential issues.


As someone who just indexed 6m documents with pgvector, I can say it’s a massive time sync - on the order of days, even with a 32 core 64Gb RDS instance.


what was the token sizes for comparison?


I've done a few 384, 762, 512 all take a few days

Though index creation is not a big deal, I want good queries rapidly for cheap. So IMO RDS with pgvector is the easiest approach.


That sounds much longer than it should. I am not sure on your exact use-case but I would encourage you to check out Marqo (https://github.com/marqo-ai/marqo - disclaimer, I am a co-founder). All inference and orchestration is included (no api calls) and many open-source or fine-tuned models can be used.


> That [pgvector index creation time] sounds much longer than it should... I would encourage you to check out Marqo

Your comment makes it sound like Marqo is a way to speed up pgvector indexing, but to be clear, Marqo is just another Vector Database and is unrelated to pgvector.


Fair enough, apologies for the confusion!


The reason I would use pgvector is because I am uninterested in another piece of infrastructure.


How does performance scale (vs pgvector) when you have an index and start loading data in parallel? Or how does this scale vs the to-be-released pgvector 0.5.2?


I'm also concerned about these (tested!) errors:

> https://github.com/lanterndata/lantern/blob/040f24253e5a2651...

> Operator <-> can only be used inside of an index

Isn't the use of the distance operator in scan+sort critical for generating the expected/correct result that's needed for validating the recall of an ANN-only index?


Ah, thank you for noticing! We actually have typo in the error message. It actually should be the operator <?> instead of <->.

There's some context on the operator <?> here: https://github.com/lanterndata/lantern?tab=readme-ov-file#a-...


We haven't benchmarked against 0.5.2 yet so I can't share exact numbers. We will benchmark it once it is released.

We think our approach will still significantly outperform pgvector because it does less on your production database.

We generate the index remotely, on a compute-optimized machine, and only use your production database for index copy.

Parallel pgvector would have to use your production database resources to run the compute-intensive HNSW index creation workload.


which version of pgvector are you using for these benchmarks?


We used 0.5.0 for these


It’s not really a fair comparison in that case:

https://x.com/pgvector/status/1711910075416432785?s=46

Do you have the code you used so that we can reproduce these results?


I added an edited note to the bottom of the blog post.

The original post and the experiments were created before pgvector 0.5.1 was out, and we had not realized there was significant work to optimize index creation time in the latest pgvector release.

We reran pgvector benchmarks with pgvector 0.5.1. Now pgvector index creation is on par or 10% faster than lantern on a single core. Lantern still allows 30x faster index creation by leveraging additional cores.

Wiki Pgvector - 36m Lantern - 43m Lantern external indexing (32 CPU): 2m 15s

Sift Pgvector - 12m30s Lantern - 7m Lantern external indexing (32 CPU): 25s

The DB parameters for the above results (both Lantern and pgvector): shared_buffers=12GB maintenance_work_mem=5GB work_mem=2GB

The DB parameters for the previous results were the defaults for both Lantern and pgvector.

Benchmarking was done using psql timing and used a 32CPU/64GB RAM machine (Linode Dedicated 64).

Feel free to reach out if you need anything for benchmarks.


> Feel free to reach out if you need anything for benchmarks.

likewise, feel free to reach out before publishing pgvector benchmarks. i'm sure we will have some tips to make them more impartial


Curious about the "outside of the database" index generation part. Is this index WAL-protected eventually?


Yes it is WAL protected: the advantage of external indexing is that the HNSW graph is being constructed externally on multiple cores instead on a single core inside the Postgres process. But eventually the graph is being parsed and processed inside Postgres with all the necessary WAL logs for blocks.



You piqued my interest enough to sign up and try...but now it needs an Access Code to try the DB, any HN special here?


Try YCW24! :)


So approximately 0% chance I could use this on AWS RDS or Aurora correct?

Still, very impressive


This extension is licensed under the Business Source License[0], which makes it incompatible with most DBaaS offerings. The BSL is a closed-source license. Good choice for Lantern, but unusable for everyone else.

Some Postgres offerings allow you to bring your own extensions to workaround limitations of these restrictive licenses, for instance Neon[1], where I work. I tried to look at the AWS docs for you, but couldn't find anything about that. I did find Trusted Language Extensions[2], but that seems to be more about writing your own extension. Couldn't find a way to upload arbitrary extensions.

I will add that you could use logical replication[3] to mirror data from your primary database into a Lantern-hosted database (or host your own database with the Lantern extension). This obviously has a couple downsides, but thought I would mention it.

[0]: https://github.com/lanterndata/lantern/commit/dda7f064ca80af...

[1]: https://neon.tech/docs/extensions/pg-extensions#custom-built...

[2]: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Postg...

[3]: https://www.postgresql.org/docs/current/logical-replication....


Likely as an extension eventually. I know RDS has a variety of postgres extensions you can use. Pg_vector is supported, so likely lantern could get support as well.

[1] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Postg...


Unless AWS is going to buy a license, which is probably going to cost them a ton of money, I don't seem them offering Lantern given its licensed under the Business Source License, which is nasty for cloud providers.

AWS is also a big proponent of pgvector, so it is more likely they would put more money into that, in my non-expert opinion.


Adding to this, the article the GP comment has linked to is for Trusted Language Extensions, which would allow AWS users to run extensions that aren't "officially" installed on RDS

however, since Latern is written in C++, it is not a Trusted Language for Postgres. If they instead used Rust (for example), then it would theoretically be able to be used with RDS, since Rust is now a Trusted Language on RDS: https://aws.amazon.com/about-aws/whats-new/2023/05/amazon-rd...


My understanding is Trusted Language Extensions refer to extensions written in PL/Rust - a Postgres extension mechanism to write user defined functions and use them in SQL queries.

PL/Rust is a more performant and more feature-rich alternative to PL/pgSQL, which is the traditional UDF scripting language for Postgres.

Building a vector index (or any index for that matter) inside Postgres is a more involved process and can not be done via the UDF interface, be it Rust, C or PL/pgSQL.

So, I think even if Lantern was written in Rust, it would not be a Trusted Language Extension under this definition.


(disclosure, i work at supabase and have been developing TLEs with the RDS team)

Trusted Language Extensions refer to an extension written in any trusted language. In this case Rust, but it also includes: plpgsql, plv8, etc. See [0]

> PL/Rust is a more performant and more feature-rich alternative to PL/pgSQL

This is only partially true. plpgsql has bindings to low-level Postgres APIs, so in some cases it is just as fast (or faster) than Rust.

> Building a vector index (or any index for that matter) inside Postgres is a more involved process and can not be done via the UDF interface, be it Rust, C or PL/pgSQL

Most PG Rust extensions are written with the excellent pgrx framework [1]. While it doesn't have index bindings right now, I can certainly imagine a future where this is possible[2].

All that said - I think there are a lot of hoops to jump through right now and I doubt it's worth it for your team. I think you are right to focus on developing a separate C extension

[0] TLE: https://supabase.com/blog/pg-tle

[1] pgrx: https://github.com/pgcentralfoundation/pgrx

[2] https://github.com/pgcentralfoundation/pgrx/issues/190#issue...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: