Hacker News new | past | comments | ask | show | jobs | submit login

How many dimensions can you use this with efficiently? Could you use this to store embeddings from machine learning models and find nearest neighbours of various items, like aNNoy or faiss does?



Postgres has a extension called cube (https://www.postgresql.org/docs/9.6/cube.html) that can be used for up to 150 dimensions (which is a compile-time limit, you can have more if you compile Postgres yourself).

It's a pretty cool extension that does distance between points, intersections between n-dimensional cubes (hence the name), different distance metrics etc.

It'd be perfect for storing and searching through large amounts of n-dimensional embeddings, I'm guessing it's used for that already.


In my experience, the cube extension is unusable for >10M x 128D vectors without PCA. I'm using Faiss now with ~500M vectors, and it works great!


With how many dimensions are you using Faiss with 100m+ vectors? I’m currently looking a solution to handle 1024 dimensions for ~100m items.


On one index I'm using OPQ16_64,IVF262144_HNSW32,PQ16 with 128 dimensions initially.

1024 dimensions is a lot! Could you elaborate on what application requires that many? If it's a DNN layer output, your data must be sparse, so dimensionality reduction won't affect your recall if tuned properly.


It's actually a DNN layer output. I haven't considered dimensionality reduction, yet. Thanks for pointing my there, I'll look into it. Probably thats the better way to go.

Thanks a lot for your reply!


Fun fact, there’s a bug in the implementation where you can create cubes with much higher dimensions because one constructor doesn’t do the check.

Still wouldn’t recommend it though.


I had the same initial thought based on the title. Unfortunately, the answer is no.

The article discusses a low-dimensional KNN problem. The curse of dimensionality guides intuition that the methods here likely will not apply to extremely high-dimensional problems.

faiss actually comes with a lot of excellent documentation that describes the problems unique to KNN on embedding vectors. In particular, for extremely large datasets, most of the tractable methods are approximations that make use of clustering, quantization, and centriod-difference tricks to make computation efficient.

See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes and related links for more information.


Its point structure supports 4 dimensions, but I'm not sure if the Z and M components are part of the distance calculation.

https://postgis.net/docs/manual-3.0/ST_MakePoint.html

Edit: The distance operations are 2D by default and it looks like there are only 3D variants.

https://postgis.net/docs/manual-3.0/ST_3DDistance.html


this is a special case of reading the internal GIST spatial index used in PostGIS by the implentation code for operator '<->' , so no joy for N-dimensional search..

You can use a python library inside PostgreSQL using plPython, but supplying the coordinates to the evaluation is not going to be as compact and specialized as this


Super curious about this as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: