
EuclidesDB: a multi-model machine learning feature database - drewvolpe
https://euclidesdb.readthedocs.io/en/latest/
======
bradhe
This is awesome! My company, Reflect, was looking at building a set of
features similar to this but on top of existing data in peoples' systems right
before we were acquired.

There's a big market of people who are looking to do simple data science and
machine learning but don't know how to get started, don't have a lot of
expertise to implement the algorithms, and the required ETL looks really
daunting. You could put all of this on rails by looking integrating with
existing systems.

~~~
tixocloud
Pardon my ignorance but could you explain further how this might simplify
things? It doesn’t seem like a ML-specific DB would make things easier on
folks who don’t know ML in the first place.

~~~
perone
If you don't know ML and you want to do similarity search, you can just use
the models that come together with EuclidesDB and just make calls to add items
and query for similarity with less than 10 lines of Python code. So it will
really simplify things for people who don't want to retrain any model or
implement a backend, indexing, search, etc.

~~~
tixocloud
Hopefully I'm thinking this the right way:

If I'm searching for a particular breed of dog, and let's say EuclidesDB has
already been trained. I just need to query based on my image of the dog breed?

How does it differ from me interacting directly with a model already saved
after training on Tensorflow?

------
enisberk
Reddit post by Author:
[https://www.reddit.com/r/MachineLearning/comments/9yhsbu/p_e...](https://www.reddit.com/r/MachineLearning/comments/9yhsbu/p_euclidesdb_a_multimodel_machine_learning/?ref=readnext)

------
Rafuino
Very cool. I've been doing some research on DBs and how deep learning
researchers connect them to their data pipelines, but this is the first open
source one I've seen explicitly designed around that purpose. One thing I
noticed is that LMDB is quite widely used, at least according to my
research... At least one paper I read said the following: "LMDB database files
. . . are predominant in the deep learning community" [1]. What makes
EuclidesDB different than LMDB?

[1]
[https://www.mcs.anl.gov/papers/P7075-0717.pdf](https://www.mcs.anl.gov/papers/P7075-0717.pdf)

~~~
oscargrouch
LMDB is a key-value database. The Euclides DB build on top of a existing key-
value database, LevelDB (could be LMDB) as the persistent layer.

Im not familiar with EuclidesDB, but from what i've understand from the
documented architecture it uses gRPC to expose a RPC interface over the
internet, and abstract away the objects in the domain of machine learning
(like images) tied to PyTorch.

Anyway, its a higher level concept and architecture modeled on top of a key-
value storage like LMDB, which is a lower level building block for this
product. LMDB being comparable to LevelDB, not EuclidesDB.

(It looks that the author is here, so please correct me if im mistaken in any
way)

~~~
perone
Hi, you're correct, LevelDB is the lower level building block for EuclidesDB,
it's the underlying storage for item's features/predictions/metadata.
EuclidesDB uses gRPC (reasons for that design decision are described in the
docs) as protocol communication and protobuf as serialization mechanism for
its RPC communication with client APIs (i.e. Python client API). EuclidesDB is
also tightly coupled with libtorch (PyTorch C++ backend) and it is EuclidesDB
that is responsible to run inference (forward pass) on the models instead of
the client, so it takes all the heavy burden from clients and adds it into the
database engine itself. EuclidesDB also has querying capabilities (using LSH
for performance) to query what was added into it, so again, the query (and
feature extraction for it) is executed by the EuclidesDB. So the comparison
between LevelDB/LMDB and EuclidesDB doesn't make much sense, they are low-
level embeddable engines for key-value storage (that EuclidesDB uses).

~~~
tixocloud
Hi, can you share what your thoughts were when wanting to build something like
this?

------
alienreborn
Looks interesting. Are you the creator? If so..

Any reason why you chose LevelDB?

I was actually looking at clipper.ai for model serving at my work. I know it's
not 1-to-1 comparison as clipper is much more generic where as this seem to
tie in closely with PyTorch. Can it support models created using other
libraries?

~~~
drewvolpe
It was created by Christian Perone, a researcher at Polytechnique Montreal. I
don't think he's active on HN. I saw he released it and just started looking
through it.

~~~
perone
Thanks for posting it here drewvolpe !

------
btown
[https://clarifai.com/developer/guide/search#search](https://clarifai.com/developer/guide/search#search)
does something very similar as a service; it allows you to ingest numerous
images, then feed them through constantly-evolving models and have any number
of model-based indices over those images that can answer similarity queries
based on previously-unseen inputs. Great to see that there's open-source
competition, and that they're focusing on developer productivity (via the
tight coupling with Torch) rather than prematurely adding layers of
abstraction.

~~~
mendeza
Isn't using pretrained CNN models make embedding biased towards extracting
features that were relevant from the dataset (i.e. ImageNet) it learned from?
Fine-grained classification uses triplet and siamese networks to learn an
embedding based on semantic similarity, but you have to define what is
semantically similar in the dataset. I am curious how well pretrain networks
generalize for indexing for image search. I think finding papers how pinterest
and ebay apply visual search at scale may shed some light on this.

~~~
eggie5
The successive convolutional layers in popular CNN architectures learn
representations from very general (edges) to very specific (dog breeds) from
the input to the last conv. layer respectively. Depending on how different
your new domain is will dictate on what layer you will take the CNN
representation from (early layers or later layers) and whether a generic
ImageNet model will work at all.

Of course if your new domain is very different than the distribution of the
original training data, it is a good technique to fine-tune the network a bit
with your data.

~~~
perone
Just an extra note, that is the key point why EuclidesDB support multiple
models, so you can have for instance a ResNet trained on ImageNet for some
images and another ResNet (same architecture) fine-tuned on your data (domain
adapted) for another different semantic space. A concrete example is to think
in the example of a fashion company who has fine-tuned different models for
different product categories:

Model A = fine-tuned to classify between different types of shoes;

Model B = fine-tuned to classify between different t-shirt types;

EuclidesDB can have these two models and you can add/query items into each one
of these different models (hence the concept of "model/module space" that is
used by EuclidesDB).

------
eggie5
I gave a talk on the theory behind image to image search if anyone is
interested. Image search is essentially what this backend well suited for and
what the graphic on their home page uses:

[http://www.eggie5.com/126-semantic-image-search-
video](http://www.eggie5.com/126-semantic-image-search-video)

~~~
garysieling
Do you know if there are common tools people use as a search engine backend
for this? Vespa.ai seems promising, although I can't tell if there are many
people using it.

~~~
eggie5
the pieces are:

* CNN (plenty of pre-trained * Approximate Nearest Neighbors Database (Annoy, etc) * webserver to host CNN and serve UI

It's popular to serve tensorflow models w/ tensorflow severing + kubernetes
and that's what' I've done in the past.

------
eggie5
If you've every tried to deploy a deep learning based image to image search
product, you will know the engineering challenges especially with the
Approximate Nearest Neighbors infrastructure. This is a good progress in
abstracting out that step!

~~~
perone
Thanks for the feedback !

------
sandGorgon
Has anyone done this serialization with a relational DB like Postgres - which
is has this hugely performant key-value store called Hstore or JSONB ?

This coupling to pytorch is very cool, but basing this on a production capable
database like postgres (which has incredible hosted solutions like Google
Cloud SQL, AWS RDS, Azure , etc) would be much more useful.

~~~
tixocloud
We’re exploring it at the moment. However, seems like there are many ways one
could “save” a model. So far we’ve focused more on handling serialised models
directly as opposed to storing key-values.

~~~
sandGorgon
what does that mean ? binary format ? Well the BLOB datatype in Postgresql is
also production ready

~~~
tixocloud
Yes, from uploading raw code to accepting binary formats.

------
pilooch
Interesting DB for feature storage and LSH is good choice I believe. I'm
wondering why the tight link to pytorch C++ tensors (under refactoring
actually), bit I haven't looked at the euclidendb code yet. Thanks for sharing
!

Those interested can also find an open source integration of lmdb + annoy
here:
[https://github.com/jolibrain/deepdetect/blob/master/src/sims...](https://github.com/jolibrain/deepdetect/blob/master/src/simsearch.cc#L188)

This the underlying support for similarity search based on embeddings,
including images and object similarity search, see
[https://github.com/jolibrain/deepdetect/tree/master/demo/obj...](https://github.com/jolibrain/deepdetect/tree/master/demo/objsearch)

This is running for apps such as a Shazam for art, faster annotation tooling
and text similarity search.

Annoy only supports indexing once, while hnwlib supports incremental indexing,
something I'm looking at.

~~~
dtjohnnyb
[https://github.com/nmslib/hnswlib](https://github.com/nmslib/hnswlib) for
anybody else googling this library

------
bratao
Fantastic project! Vespa.ai is also an alternative more focused in NLP

------
devj
Noob question: Which data serialisation format is used to represent models?
Are there any standardisation efforts being undertaken by the community?

~~~
jrumbut
Not such a noob question!

R and Python (pickle, for instance) both have facilities for serializing
objects that can (sometimes) be used for ML models. Additionally, tensorflow
has an API for saving models and I've seen people save the coefficients and
structure of neural nets in HDF5 files
([https://danieltakeshi.github.io/2017/07/06/saving-neural-
net...](https://danieltakeshi.github.io/2017/07/06/saving-neural-network-
model-weights-using-a-hierarchical-organization/)).

But I wouldn't say there is a cross platform, universally adopted, model
agnostic standard at this point (and it would be very difficult to create such
a thing).

~~~
agibsonccc
Disclaimer: I sell a product in this space.

Our model server just works with every format. We created our own DL framework
and maintain that. Eg: We're the 2nd largest contributor to keras alongside
having our own DL framework.

We've found it's easier to just have 1 runtime with PMML, TF file import,
keras file import, onnx interop coupled with endpoints for reading and writing
numpy, apache arrow's tensor format, and our own libraries binary format. We
also have converters for r and python pickle to pmml as well. Happy to
elaborate if interested.

For ingest we typically allow storage of transformations, accepting raw
images, arrays etc. We have our own HDF5 bindings as well but mainly for
reading keras files. Could you tell me a bit about what you might want for ETL
from HDf5?

~~~
jrumbut
My work right now is more at the research/prototyping end (and often not deep
learning), unfortunately. I will say though that your product is exactly the
sort of thing I was referring to, and deeplearning4j is a great resource. It's
cool to run into the author!

Would you be willing to characterise the sort of system that you end up
replacing, what your customers were doing before using your service?

~~~
agibsonccc
Sure! So our model server usually replaces a bunch of docker containers with
different standards with zero ways of inputting data.

An interesting example was a customer that bought a startup. The startup had a
docker container that leaked memory (they were using an advanced TF library,
google quit maintaining it). The parent company had tomcat infrastructure.
They told the startup to get it to work in tomcat. They approached us.

Dl4j itself is built on top of a more fundamental building block that allows
us to do some cool stuff in c: [https://github.com/bytedeco/javacpp-
presets](https://github.com/bytedeco/javacpp-presets)

We maintain the above as well. We have our own tensorflow, caffe, opencv,onnx
etc bindings. This gives you the same kind of pointer/zero copy interop you
would in python but in java.

Another area we play in is RPA. You have windows consultants who don't know
machine learning but deal with business process data all day. They want
standardized APIs and to run on prem on windows:
[https://www.uipath.com/webinars/rpa-innovation-
week/custom-b...](https://www.uipath.com/webinars/rpa-innovation-week/custom-
built-machine-learning-uipath-rpa)

They want the ability to automatically maintain models after they are
implemented with an integration that auto corrects the model. We do that using
our platform's apis. On line retraining is another thing we do.

We also do automatically rollback based on feedback if you specify a test set.
If we find the test is less accurate for a specified metric over time, we roll
back the model. This prevents a problem in machine learning called concept
drift which means the domain changes over time.

Lastly, you have customers generating their own models and you need to
track/audit them all. Sometimes to do online retraining like the RPA use case.
Some customers scale to millions of models. They need this kind of tracking.

