
Ask HN: Vector-space database (as a service)? - kleebeesh
Methods like collaborative filtering via matrix factorization, Word2vec, Doc2vec, etc.. map large, sparse matrices into a low-dimensional vector space while enforcing similarity constraints. There are extensions for vectorizing various modalities (users, items, documents, audio, images, etc) into one vector-space for similarity search and recommendation ([1], [2], [3]). There is extensive research on approximate nearest-neighbor searches ([4]).<p>For example: it&#x27;s possible to map users, songs, and artists into a common vector space ([1]). Two users who listen to similar songs have high similarity. Songs are recommended based on vector similarity to users. This pattern extends to many domains as long as there is a way to enforce similarity (likes, co-occurrences, etc.) to &quot;train&quot; the vectors.<p>In my experience, training the vectors is simpler than the engineering to efficiently query them (e.g. &quot;select the 10 nearest neighbors to vector with ID 123&quot;). This becomes expensive for large datasets, and correctly using the approximate nearest neighbor libraries is non-trivial.<p>I can&#x27;t find any database to insert vectors as they&#x27;re computed and then run queries against them. It seems often companies build a custom API on top of one of the approximate nearest neighbors libraries. Though the interesting queries seem pretty homogeneous.<p>Any ideas as to why none of the big DB players have an offering for this use-case? Like Algolia, but for vectors instead of text? Any recommendations for such a product?<p>[1] IHeartRadio queries various modalities of data from the same vector space: https:&#x2F;&#x2F;youtu.be&#x2F;jjO1gOH-BW4?t=5m39s 
[2] Using a convnet to map new (cold-start) songs into an existing vector space: http:&#x2F;&#x2F;benanne.github.io&#x2F;2014&#x2F;08&#x2F;05&#x2F;spotify-cnns.html
[3] Flickr similarity search: http:&#x2F;&#x2F;code.flickr.net&#x2F;2017&#x2F;03&#x2F;07&#x2F;introducing-similarity-search-at-flickr&#x2F;
[4] Benchmarks for approximate nearest neighbor libs: https:&#x2F;&#x2F;github.com&#x2F;erikbern&#x2F;ann-benchmarks
======
PaulHoule
Hyperdimensional nearest-neighbor search is a tough problem; there are index
algorithms such as ball trees that work, but they don't deliver the big wins
that b-trees give in 1-d space, quadtrees in 2-d space, etc.

In many "as a service" offerings computational costs are not a big deal. For
this one it would be, thus making the pricing work right for everybody would
be a toughie.

------
billconan
I thought about word2vec as a service. I gave up because I think customers
could easily cache (pirate) my data.

~~~
kleebeesh
That's not really a problem if the user is frequently ingesting and
vectorizing new data. They need a place to store it and efficiently query it.
They can cache the queries for an old vector, but still need to compute new
queries every time they have a new piece of content. Also old vectors might
have relationships to the new vectors that have been inserted since caching,
so you don't want to serve stale results.

~~~
billconan
if we forget about making the service for a moment, How would you store and
process high dimensional vectors locally? What ready-made library/software to
use? What datastructure?

~~~
kleebeesh
Some storage options are: \- Store many vectors in a single HDF5 or LMDB file.
\- Store single vectors in many small binary files (e.g. using numpy save()
function).

To look up neighbors you might: \- Compute neighbors exhaustively (e.g. using
Scipy distance functions). \- Use an Approximate Nearest Neighbors approach
like one of the ones benchmarked here ([https://github.com/erikbern/ann-
benchmarks](https://github.com/erikbern/ann-benchmarks)). These can be much
faster than exhaustively computing neighbors, at the cost of some accuracy and
having to periodically re-build an index.

