Hacker News new | past | comments | ask | show | jobs | submit login
EuclidesDB: a multi-model machine learning feature database (readthedocs.io)
159 points by drewvolpe 6 months ago | hide | past | web | favorite | 42 comments

This is awesome! My company, Reflect, was looking at building a set of features similar to this but on top of existing data in peoples' systems right before we were acquired.

There's a big market of people who are looking to do simple data science and machine learning but don't know how to get started, don't have a lot of expertise to implement the algorithms, and the required ETL looks really daunting. You could put all of this on rails by looking integrating with existing systems.

Pardon my ignorance but could you explain further how this might simplify things? It doesn’t seem like a ML-specific DB would make things easier on folks who don’t know ML in the first place.

If you don't know ML and you want to do similarity search, you can just use the models that come together with EuclidesDB and just make calls to add items and query for similarity with less than 10 lines of Python code. So it will really simplify things for people who don't want to retrain any model or implement a backend, indexing, search, etc.

Hopefully I'm thinking this the right way:

If I'm searching for a particular breed of dog, and let's say EuclidesDB has already been trained. I just need to query based on my image of the dog breed?

How does it differ from me interacting directly with a model already saved after training on Tensorflow?

Very cool. I've been doing some research on DBs and how deep learning researchers connect them to their data pipelines, but this is the first open source one I've seen explicitly designed around that purpose. One thing I noticed is that LMDB is quite widely used, at least according to my research... At least one paper I read said the following: "LMDB database files . . . are predominant in the deep learning community" [1]. What makes EuclidesDB different than LMDB?

[1] https://www.mcs.anl.gov/papers/P7075-0717.pdf

LMDB is a key-value database. The Euclides DB build on top of a existing key-value database, LevelDB (could be LMDB) as the persistent layer.

Im not familiar with EuclidesDB, but from what i've understand from the documented architecture it uses gRPC to expose a RPC interface over the internet, and abstract away the objects in the domain of machine learning (like images) tied to PyTorch.

Anyway, its a higher level concept and architecture modeled on top of a key-value storage like LMDB, which is a lower level building block for this product. LMDB being comparable to LevelDB, not EuclidesDB.

(It looks that the author is here, so please correct me if im mistaken in any way)

Hi, you're correct, LevelDB is the lower level building block for EuclidesDB, it's the underlying storage for item's features/predictions/metadata. EuclidesDB uses gRPC (reasons for that design decision are described in the docs) as protocol communication and protobuf as serialization mechanism for its RPC communication with client APIs (i.e. Python client API). EuclidesDB is also tightly coupled with libtorch (PyTorch C++ backend) and it is EuclidesDB that is responsible to run inference (forward pass) on the models instead of the client, so it takes all the heavy burden from clients and adds it into the database engine itself. EuclidesDB also has querying capabilities (using LSH for performance) to query what was added into it, so again, the query (and feature extraction for it) is executed by the EuclidesDB. So the comparison between LevelDB/LMDB and EuclidesDB doesn't make much sense, they are low-level embeddable engines for key-value storage (that EuclidesDB uses).

Hi, can you share what your thoughts were when wanting to build something like this?

Looks interesting. Are you the creator? If so..

Any reason why you chose LevelDB?

I was actually looking at clipper.ai for model serving at my work. I know it's not 1-to-1 comparison as clipper is much more generic where as this seem to tie in closely with PyTorch. Can it support models created using other libraries?

Hi, thanks for the feedback, I'm the creator of the framework. Answering your question: I chose LevelDB because it offers a good transactional capability for the key-value store, it also provides a very easy-to-use embedded library that is very stable while providing a decent performance and a binary serialization that helps me to store protobuf-serialized content. In summary, it's due to a mix of stability, database lock, binary store and transactional support. I use it to store item metadata (still work on to expose to the APIs) and model features/predictions.

It was created by Christian Perone, a researcher at Polytechnique Montreal. I don't think he's active on HN. I saw he released it and just started looking through it.

Thanks for posting it here drewvolpe !

It seems this solves a different problem from clipper.ai, which seems to store model parameters themselves to answer f(inbound object, model); this also stores data, so it can answer f(inbound object, thousands of existing objects, model) where f might be nearest-neighbor search.

Not OP, but it looks like right now it only supports serving JIT compiled PyTorch models.

https://clarifai.com/developer/guide/search#search does something very similar as a service; it allows you to ingest numerous images, then feed them through constantly-evolving models and have any number of model-based indices over those images that can answer similarity queries based on previously-unseen inputs. Great to see that there's open-source competition, and that they're focusing on developer productivity (via the tight coupling with Torch) rather than prematurely adding layers of abstraction.

Isn't using pretrained CNN models make embedding biased towards extracting features that were relevant from the dataset (i.e. ImageNet) it learned from? Fine-grained classification uses triplet and siamese networks to learn an embedding based on semantic similarity, but you have to define what is semantically similar in the dataset. I am curious how well pretrain networks generalize for indexing for image search. I think finding papers how pinterest and ebay apply visual search at scale may shed some light on this.

The successive convolutional layers in popular CNN architectures learn representations from very general (edges) to very specific (dog breeds) from the input to the last conv. layer respectively. Depending on how different your new domain is will dictate on what layer you will take the CNN representation from (early layers or later layers) and whether a generic ImageNet model will work at all.

Of course if your new domain is very different than the distribution of the original training data, it is a good technique to fine-tune the network a bit with your data.

Just an extra note, that is the key point why EuclidesDB support multiple models, so you can have for instance a ResNet trained on ImageNet for some images and another ResNet (same architecture) fine-tuned on your data (domain adapted) for another different semantic space. A concrete example is to think in the example of a fashion company who has fine-tuned different models for different product categories:

Model A = fine-tuned to classify between different types of shoes;

Model B = fine-tuned to classify between different t-shirt types;

EuclidesDB can have these two models and you can add/query items into each one of these different models (hence the concept of "model/module space" that is used by EuclidesDB).

yes, this is essentially the Clarifai product: use a CNN (without the softmax) as a feature extractor for images and then do ANN search on them in the embedding space.

I gave a talk on the theory behind image to image search if anyone is interested. Image search is essentially what this backend well suited for and what the graphic on their home page uses:


Do you know if there are common tools people use as a search engine backend for this? Vespa.ai seems promising, although I can't tell if there are many people using it.

the pieces are:

* CNN (plenty of pre-trained * Approximate Nearest Neighbors Database (Annoy, etc) * webserver to host CNN and serve UI

It's popular to serve tensorflow models w/ tensorflow severing + kubernetes and that's what' I've done in the past.

If you've every tried to deploy a deep learning based image to image search product, you will know the engineering challenges especially with the Approximate Nearest Neighbors infrastructure. This is a good progress in abstracting out that step!

Thanks for the feedback !

Has anyone done this serialization with a relational DB like Postgres - which is has this hugely performant key-value store called Hstore or JSONB ?

This coupling to pytorch is very cool, but basing this on a production capable database like postgres (which has incredible hosted solutions like Google Cloud SQL, AWS RDS, Azure , etc) would be much more useful.

We’re exploring it at the moment. However, seems like there are many ways one could “save” a model. So far we’ve focused more on handling serialised models directly as opposed to storing key-values.

what does that mean ? binary format ? Well the BLOB datatype in Postgresql is also production ready

Yes, from uploading raw code to accepting binary formats.

Maybe something like http://madlib.apache.org/ ?

Interesting DB for feature storage and LSH is good choice I believe. I'm wondering why the tight link to pytorch C++ tensors (under refactoring actually), bit I haven't looked at the euclidendb code yet. Thanks for sharing !

Those interested can also find an open source integration of lmdb + annoy here: https://github.com/jolibrain/deepdetect/blob/master/src/sims...

This the underlying support for similarity search based on embeddings, including images and object similarity search, see https://github.com/jolibrain/deepdetect/tree/master/demo/obj...

This is running for apps such as a Shazam for art, faster annotation tooling and text similarity search.

Annoy only supports indexing once, while hnwlib supports incremental indexing, something I'm looking at.

https://github.com/nmslib/hnswlib for anybody else googling this library

We'll be integrating other indexing in near future (such as faiss), Annoy is just one option for indexing that was implemented. Each indexing method will have their pros/cons, so you'll be able to select the search engine backend according to your restrictions.

There are many reasons why we depart from other libraries, many of them, for instance, uses JSON+base64 (http/1) for serialization, while we use protobuf+gGRPC (http/2).

Fantastic project! Vespa.ai is also an alternative more focused in NLP

Noob question: Which data serialisation format is used to represent models? Are there any standardisation efforts being undertaken by the community?

Not such a noob question!

R and Python (pickle, for instance) both have facilities for serializing objects that can (sometimes) be used for ML models. Additionally, tensorflow has an API for saving models and I've seen people save the coefficients and structure of neural nets in HDF5 files (https://danieltakeshi.github.io/2017/07/06/saving-neural-net...).

But I wouldn't say there is a cross platform, universally adopted, model agnostic standard at this point (and it would be very difficult to create such a thing).

Disclaimer: I sell a product in this space.

Our model server just works with every format. We created our own DL framework and maintain that. Eg: We're the 2nd largest contributor to keras alongside having our own DL framework.

We've found it's easier to just have 1 runtime with PMML, TF file import, keras file import, onnx interop coupled with endpoints for reading and writing numpy, apache arrow's tensor format, and our own libraries binary format. We also have converters for r and python pickle to pmml as well. Happy to elaborate if interested.

For ingest we typically allow storage of transformations, accepting raw images, arrays etc. We have our own HDF5 bindings as well but mainly for reading keras files. Could you tell me a bit about what you might want for ETL from HDf5?

My work right now is more at the research/prototyping end (and often not deep learning), unfortunately. I will say though that your product is exactly the sort of thing I was referring to, and deeplearning4j is a great resource. It's cool to run into the author!

Would you be willing to characterise the sort of system that you end up replacing, what your customers were doing before using your service?

Sure! So our model server usually replaces a bunch of docker containers with different standards with zero ways of inputting data.

An interesting example was a customer that bought a startup. The startup had a docker container that leaked memory (they were using an advanced TF library, google quit maintaining it). The parent company had tomcat infrastructure. They told the startup to get it to work in tomcat. They approached us.

Dl4j itself is built on top of a more fundamental building block that allows us to do some cool stuff in c: https://github.com/bytedeco/javacpp-presets

We maintain the above as well. We have our own tensorflow, caffe, opencv,onnx etc bindings. This gives you the same kind of pointer/zero copy interop you would in python but in java.

Another area we play in is RPA. You have windows consultants who don't know machine learning but deal with business process data all day. They want standardized APIs and to run on prem on windows: https://www.uipath.com/webinars/rpa-innovation-week/custom-b...

They want the ability to automatically maintain models after they are implemented with an integration that auto corrects the model. We do that using our platform's apis. On line retraining is another thing we do.

We also do automatically rollback based on feedback if you specify a test set. If we find the test is less accurate for a specified metric over time, we roll back the model. This prevents a problem in machine learning called concept drift which means the domain changes over time.

Lastly, you have customers generating their own models and you need to track/audit them all. Sometimes to do online retraining like the RPA use case. Some customers scale to millions of models. They need this kind of tracking.

It’s making life for us quite tricky! Any advice on which way is the most widely adopted method?

If you find out let me know! Issues surrounding how these models are used in production (deployment, retraining, monitoring, versioning, etc, etc) are, I believe, underexplored.

It's tempting to think that everyone else has this figured out but I don't think so.

There is a lot going on in this field right now, take a look at the ONNX ecosystem (https://onnx.ai/).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact