Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Embeddinghub: A vector database built for Machine Learning embeddings (github.com/featureform)
118 points by cyrusthegreat 37 days ago | hide | past | favorite | 33 comments



Hi everyone!

Over the years, I've found myself building hacky solutions to serve and manage my embeddings. I’m excited to share Embeddinghub, an open-source vector database for ML embeddings. It is built with four goals in mind:

Store embeddings durably and with high availability

Allow for approximate nearest neighbor operations

Enable other operations like partitioning, sub-indices, and averaging

Manage versioning, access control, and rollbacks painlessly

It's still in the early stages, and before we committed more dev time to it we wanted to get your feedback. Let us know what you think and what you'd like to see!

Repo: https://github.com/featureform/embeddinghub

Docs: https://docs.featureform.com/

What's an Embedding? The Definitive Guide to Embeddings: https://www.featureform.com/post/the-definitive-guide-to-emb...


In the "Definitive Guide to Embeddings", in the figure "An illustration of One Hot Encoding", the "One Hot Encoding" table doesn't make any sense whatsoever. Am I wrong?


no you're right ahahah wth are these


You are both right. I just realized this and would be embarrassed if I wasn’t laughing so hard. I gave an original drawing to our designer with the correct values and we didn’t inspect their final image. We’ll get this fixed, thanks for pointing this out and sorry for the confusion :)


Holy shit, this looks amazing!

I see you've got examples for NLP use cases in your docs. Can't wait to read them. Embeddings are a constant source of complexity when I'm trying to move certain operations to Lambda, this looks like it would speed the initializations up big time.


We're so glad to hear that! We'd love your feedback as we keep building. Please join our community on Slack: https://join.slack.com/t/featureform-community/shared_invite...


Curious about how your solution is different / better than nmslib which I've tried in the past?


We actually use HNSWLIB by NMSLIB on the backend. NMSLIB is solving the approximate nearest neighbor problem, not the storage problem. It’s not a database, it’s an index. We handle everything needed to turn their index into a full fledged database with a data science workflow around it (versioning, monitoring, etc.)


That's great. I've been very impressed by the performance of nmslib in my scenarios. I'll definitely check out eh - thanks for sharing!


Where can I find documentation on versioning? My first use case would be to versión different embeddings and use it more like a storage backend than to search for KNN. Would it be possible to not create the NN graph and just use it for versioned storage? We currently use opendistro and it nicely allows doing pre and post filtering based on other document fields (other than the embedding). Therefore I think this could never be a full replacement without figuring out how to combine the rest of the document structure


Hey! We're actually polishing up a PR that'll add documentation and finalize the versioning API, it should be merged in this weekend. Would you be up for a quick chat with someone on our team? It would be interesting to get your feedback and see what else we're missing to be a drop-in replacement to opendistro, join our slack if so. We'll dm you :) https://join.slack.com/t/featureform-community/shared_invite...


How is this different from Pinecone, Milvus, and Faiss?


Pinecone is closed source and only available as a SaaS service. Milvus and us have more overlap, we’re focused on the embeddings workflow like versioning and using embedding with other features. Milvus is entirely focused on nearest neighbor operations.

Faiss is solving the approximate nearest neighbor problem, not the storage problem. It’s not a database, it’s an index. We use a lightweight version of Faiss (HNSWLIB) to index embeddings in Embeddinghub.


I'm from Pinecone so I can chime in...

The biggest difference, as cyrusthegreat pointed out, is that we're a fully managed service. You sign up, spin up a database service with a single API call[0], and go from there. There's no infrastructure to build and keep available, even as you scale to billions of items.

Pinecone also comes with features like metadata filtering[1] for better control over results, and hybrid storage for up to 10x lower compute costs. EmbeddingHub has a few features Pinecone doesn't yet have, like versioning -- though with our architecture it's straightforward to add if someone asks.

Hope that helps! And I'm glad to see more projects in this space, especially from the feature-store side.

[0] https://www.pinecone.io/docs/api/operation/create_index/

[1] https://www.youtube.com/watch?v=r5CsJ_S9_w4


Cool! Nice work! Do you have any performance numbers you could share?

Specifically around nearest neighbor computation latency, a regular get embedding latency, read/write rate achieved on a machine?


Not yet, this is very much an early release to get it in people's hands and to get feedback on the API and the functionality. We've purposely held off optimizing too much until we feel more confident that this is useful and our API approach makes sense for people. That said, Simba who's one of the main devs actually comes from a performance tuning background at Google. Also, it's built on HNSWLIB and RocksDB, and is being used in real world workloads today.


This is really great! It speaks very much to my use-case (building user embeddings and serving them both to analysts + other ML models).

I was wondering if there was a reasonable way to store raw data next to the embeddings such that: 1. Analysts can run queries to filter down to a space they understand (the raw data). 2. Nearest neighbors can be run on top of their selection on the embedding space.

Our main use case is segmentation, so giving analysts access to the raw feature space is very important.


This is in the works! We'd love you feedback on the API and to learn a bit more about your use-case so we build the right thing, mind joining our slack? https://join.slack.com/t/featureform-community/shared_invite...


Nice, are there any benchmarks?

Would be interesting to see how it compares to Postgres or LevelDB for read/write of exact values

And how it compares to Faiss/Annoy for KNN


Great work! Looks like you are using HNSWLIB. From what I understand HNSW graph based approach can be memory intensive compared PQ code based approach. FAISS has support for both HNSW and PQ codes. Any plans on extending your work to support PQ code based index in future?


Yes! We plan to bring Faiss in and utilize a lot of its functionality, our goal for this release was to get an end-to-end working to get feedback on the API. HNSW was a good default with this in mind.


How does it compare to the OpenDistro for Elastic KNN plugin - which also uses HNSW (and also includes scalable storage, high availability, backups, and filtering)?


Our API is built from the ground up with the machine learning workflow in mind. For example, we have a training API that allows you to batch requests and even download your embeddings and generate an HNSW index locally. Our view of versioning, rollbacks, and more makes a lot of sense for an ML index, but very little sense for a search index.


What makes this different from something like gensim? They have vector search for doc2vec embeddings.


Gensim is great for generating certain types of embeddings, but not for operationalizing them. It doesn’t do approximate nearest neighbor lookup which is a deal breaker for most models that use embeddings at scale. It also do not manage versioning so you end up having to hack a workflow around it to manage embedding. Finally, it’s not really data infrastructure like this is, so you end up doing hacky things like copying all your embeddings to every docker file. With regards to serving embeddings, gensim is just a library that supports in-memory brute force nearest neighbour look ups.


gensim actually allows you to use both annoy and nmslib with gensim generated vectors as part of the api.

https://radimrehurek.com/gensim/similarities/nmslib.html

https://radimrehurek.com/gensim/similarities/annoy.html


This looks awesome - psyched to try! Embeddings are a bitch, nice to see some new tools for managing them :)


Thanks for the kind words! We'd love to get your feedback as we iterate. Please join our slack community: https://join.slack.com/t/featureform-community/shared_invite...


which search algorithm does it use?


We use HNSW internally via HNSWLIB, it's the same algorithm that Facebook uses to power their embedding search.


thanks! how did you make the decision to use hnsw over faiss and other search algorithms?


Faiss actually also uses HNSW internally, HNSWLIB is just a lighter weight implementation which allowed us to iterate faster. In the future we will switch it back out for FAISS to take advantage of its full array of functionality.


keep up with the good work!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: