
Using the linear distance operator in Postgres 12 to find the closest match - chmaynard
https://www.2ndquadrant.com/en/blog/postgresql-12-implementing-k-nearest-neighbor-space-partitioned-generalized-search-tree-indexes/
======
miles-po
While the unindexed time is impressive for the lack of index, the GIST and SP-
GIST times are not 0.939 sec and 0.358 sec. All times are in _milliseconds_.
GIST returns in slightly less than 1ms, not 1s. SP-GIST returns in slightly
more than 1/3 of a millisecond.

Performance for GIST and SP-GIST are in fact three orders of magnitude better
than what the author summarized!

------
zRedShift
So it's using a K-D-B-Tree under the hood? SP-GiST, that is.

[https://en.wikipedia.org/wiki/K-D-B-tree](https://en.wikipedia.org/wiki/K-D-
B-tree)

------
dtjohnnyb
How many dimensions can you use this with efficiently? Could you use this to
store embeddings from machine learning models and find nearest neighbours of
various items, like aNNoy or faiss does?

~~~
changelink
Postgres has a extension called cube
([https://www.postgresql.org/docs/9.6/cube.html](https://www.postgresql.org/docs/9.6/cube.html))
that can be used for up to 150 dimensions (which is a compile-time limit, you
can have more if you compile Postgres yourself).

It's a pretty cool extension that does distance between points, intersections
between n-dimensional cubes (hence the name), different distance metrics etc.

It'd be perfect for storing and searching through large amounts of
n-dimensional embeddings, I'm guessing it's used for that already.

~~~
kaivi
In my experience, the cube extension is unusable for >10M x 128D vectors
without PCA. I'm using Faiss now with ~500M vectors, and it works great!

~~~
phelmig
With how many dimensions are you using Faiss with 100m+ vectors? I’m currently
looking a solution to handle 1024 dimensions for ~100m items.

~~~
kaivi
On one index I'm using OPQ16_64,IVF262144_HNSW32,PQ16 with 128 dimensions
initially.

1024 dimensions is a lot! Could you elaborate on what application requires
that many? If it's a DNN layer output, your data must be sparse, so
dimensionality reduction won't affect your recall if tuned properly.

~~~
phelmig
It's actually a DNN layer output. I haven't considered dimensionality
reduction, yet. Thanks for pointing my there, I'll look into it. Probably
thats the better way to go.

Thanks a lot for your reply!

------
jonatron
I find it strange that Postgres has features like these, but lacks bit_count,
which is required for hamming distance. It's easy to write a C-Language User
Defined Function, but then you can't use RDS.

~~~
takeda
One of greatest things about Postgres is its extension system which is meant
for things like this. RDS not supporting custom extension is RDS problem and
frankly I wouldn't recommend using RDS. You are losing a lot of functionality,
upgrades are tougher etc. Postgres maintenance is quite low the hardest part
is probably running and RDS doesn't save you from that.

Anyway this feature discussed (searching based on a distance) is part of
PostGIS extension they are demonstrating here SP-GiST index.

~~~
Znafon
> RDS not supporting custom extension is RDS problem and frankly I wouldn't
> recommend using RDS.

We are using RDS and would like to have a standard PostgreSQL instead. Is
there some community supported AMIs that could help us to get started quickly?

~~~
vosper
> _We are using RDS and would like to have a standard PostgreSQL instead._

Having not used Postgres RDS, what about it makes you want to have a standard
Postgres instead?

~~~
Znafon
I think RDS has some very strong pros:

    
    
      - not so bad pricing
      - good defaults
      - no need to worry regarding the backups and easy disaster recovery
      - some monitoring out of the box
      - support TLS with a valid cert out of the box
      - easy replication
    

I'm not an ops so not having to worry about the operational aspects of our
datastore was great.

Here are some pain-points we have, I'm probably forgetting some but those were
really painful:

\- You don't have a superadmin access, we had several issues where some
tables, sequences or objects where associated with a user we wanted to delete
and this was really hard, I think we were supposed to change the owner of each
objects one by one. [0] shows the issue. In the end, we solved it by dumping
the database, deleting the user, recreating the database. Thankfully, this
wasn't in prod.

\- Many extensions cannot be installed, for example Timescale. It's a bit sad
having PostgreSQL and not being able to take full advantage of it.

What I regret about those is that there is nothing you can do, for the first
point I looked everywhere and many have this issue but there is nothing you
can do to solve it easily, there is posts both on StackOverflow and PostgreSQL
mailing list but no easy solution, you have to track one by one the object
whose owner you want to delete and reassign them. For the second, it means you
can't use many useful extensions like Timescale or anything not yet released.

If there was some AMI with PostgreSQL already installed and correctly
configured on which we could just connect an EBS, we could migrate very
easily.

[0] [https://stackoverflow.com/questions/44718654/cant-delete-
pos...](https://stackoverflow.com/questions/44718654/cant-delete-postgresql-
user-who-poses-a-security-risk/44718983#44718983)

------
copperx
Is there any way to get the closest _driving_ distance in Postgres without
relying on some API?

I haven't done anything like this, bit it is possible to use the data from
OpenStreetMap locally along with Postgres to e.g. sort results by shortest
driving distance?

Maybe there's an extension? I'm just curious.

~~~
shabble
You might want pgrouting[1], if you can find access to appropriate datasets
for your routing (road) network.

[1]
[http://workshop.pgrouting.org/2.5/en/chapters/advanced.html#...](http://workshop.pgrouting.org/2.5/en/chapters/advanced.html#routing-
for-vehicles)

~~~
copperx
That's really great, thanks.

------
Sniffnoy
I'm a little confused as to what this does. Is this just Euclidean distance,
or is it meant to be used for longitude/latitude? The article doesn't make it
clear.

~~~
mistrial9
this is distance in units of degree 'Euclidean' .. not so great, except that
the answer is expected to be the "closest" point, which tends to be smaller
and need less precision.. not useful for measurement in meters

------
mlthoughts2018
If there is going to be any system capable of supporting creating something
like The Matrix, it’s going to be Postgres. Tries to eat all application logic
by suggested move in database, use language extensions.

I’ve only ever found it to be a false promise. You get locked inside the db
and can’t create application logic outside the db, and the internal language
extensions are super poor solutions for this if you need multiple
environments, multiple packages, different versions of the same extension-
based app.

I love Postgres as a db, but I feel like all the viral marketing for moving to
internal db app logic, especially with extensions, is a big trap, and the
tutorials & blog posts give you some really superficial use cases to motivate
it, and in reality you’re speeding toward a cliff.

~~~
CuriouslyC
If you have a problem that is never going to be google scale, and you have
people who are familiar with postgres, you can deliver applications that are a
fraction of the code with superior performance.

The only problem with db-centric applications is the tooling, which is
inferior to standard application tooling at this point, in terms of version
control and testability. Given the hype train building around postgres as a
platform, I expect this is going to change pretty quickly.

~~~
mlthoughts2018
Disagree. For example, putting machine learning in the database is miserable.
The same is true for other X besides ML. Does not matter what the scale is.

~~~
lootsauce
We have found in database ML to be great to productionalize basic machine-
learning methods. We use BigQuery ML and it saved us a ton of fussing with
various frameworks, and because we were already in BQ it saved us the moving
of data from database to ML cluster and back into database. We simply author
some SQL and get great results quickly.

~~~
mlthoughts2018
That strikes me as bad usage of ML algorithms for quite a few reasons. People
who say this kind of thing usually are not ML practitioners and want to drink
some Kool Aid that they can save money & avoid hiring them with techniques
like this.

How are you benchmarking the in-DB models against variations that can improve
them? If they are basic regression models, how are you solving nuanced
problems relating to coefficient estimates that cannot be solved with plug and
chug libraries (such as from the article “Let’s Put the Garbage Can
Regressions and Garbage Can Probits Where They Belong” by Achen)?

I’d venture to guess you’re assuming one particular type of model just works,
and either are unaware it is failing, or are unaware of wasted money left on
the table by not improving or understanding it.

This is extremely common especially with simpler use cases. For example, you
might think A/B testing is easily solved as an abstract problem and you just
need a framework. What could be hard? I’d argue that A/B testing is a great
example of something nobody should be buying from a vendor or getting from
unexamined library output, and it’s extremely easy to be losing a bunch of
money on junk tests and not even understand your tests are junk.

It only becomes much more true with OLS & logistic regression, decision trees,
SVMs. You need customized systems operated by people who actually know
statistics. Embedding this inside a DB where you cannot seriously develop
software to customize the models & analysis just doesn’t work. The cost of
serializing the data elsewhere is not the salient one.

~~~
CuriouslyC
I've never heard of anyone developing IN the database, rather it's a place
where you deploy code, so you can do fun things like spectral_cluster((select
...)) or gp_predict_value((select ...)) in the db, avoiding needless
serialization and transfer.

