
Building a Content-Based Search Engine II: Extracting Feature Vectors - deepideas
http://www.deepideas.net/building-content-based-search-engine-feature-extraction/
======
pilooch
You can setup a feature based image search engine or image object search
engine with Open Source libraries, a few additional lines of Python, and a
couple of pre-trained models, see
[https://github.com/beniz/deepdetect/tree/master/demo/imgsear...](https://github.com/beniz/deepdetect/tree/master/demo/imgsearch)
and
[https://github.com/beniz/deepdetect/tree/master/demo/objsear...](https://github.com/beniz/deepdetect/tree/master/demo/objsearch)

------
nl
This is a good summary.

I've build an interesting NN-based model, and I've been thinking about using
some of the early layers as feature for a search engine.

The obvious thing to do is to dump the feature vectors into Elastic or Solr or
something.

Ideally what I want is to:

1) Put 1024 dimensional vectors of floats into the index

2) Use plain cosine distance as the distance metric

2b) Customize the distance metric (or preferably use a preexisting and
optimized earth-mover-distance implementation).

My initial Googling indicated that 1 and 2 are harder than I expected - it
seems both Elastic and Solr don't have good representations for vectors, and
assume you want BM25 or TF-IDF for your ranking.

Surely I'm missing something? Not super-keen on having to drop back to using
Lucene.

~~~
polm23
I don't think you're missing anything. Elastic (I'm not very familiar with
Solr) is built on assumptions from TFIDF etc., and a big one is that you
reduce the size of your potential result set by looking up words in an index.
Obviously that doesn't hold if your documents are represented as word vectors.

You can use something like annoy for searching vector space. It's not made
specifically for text search, so you'll have to roll your own normalization
and so on, but that's usually the less complicated part anyway.

[https://github.com/spotify/annoy](https://github.com/spotify/annoy)

~~~
nl
Yeah, Annoy (and the other reply which pointed to FAISS) are interesting.

I read [https://erikbern.com/2018/02/15/new-benchmarks-for-
approxima...](https://erikbern.com/2018/02/15/new-benchmarks-for-approximate-
nearest-neighbors.html) (by the author of Annoy) with great interest.

He points to HNSW which I've know nothing about, but seems very fast!

