Hacker News new | past | comments | ask | show | jobs | submit login
Using the linear distance operator in Postgres 12 to find the closest match (2ndquadrant.com)
113 points by chmaynard 8 days ago | hide | past | web | favorite | 39 comments

While the unindexed time is impressive for the lack of index, the GIST and SP-GIST times are not 0.939 sec and 0.358 sec. All times are in milliseconds. GIST returns in slightly less than 1ms, not 1s. SP-GIST returns in slightly more than 1/3 of a millisecond.

Performance for GIST and SP-GIST are in fact three orders of magnitude better than what the author summarized!

So it's using a K-D-B-Tree under the hood? SP-GiST, that is.


How many dimensions can you use this with efficiently? Could you use this to store embeddings from machine learning models and find nearest neighbours of various items, like aNNoy or faiss does?

Postgres has a extension called cube (https://www.postgresql.org/docs/9.6/cube.html) that can be used for up to 150 dimensions (which is a compile-time limit, you can have more if you compile Postgres yourself).

It's a pretty cool extension that does distance between points, intersections between n-dimensional cubes (hence the name), different distance metrics etc.

It'd be perfect for storing and searching through large amounts of n-dimensional embeddings, I'm guessing it's used for that already.

In my experience, the cube extension is unusable for >10M x 128D vectors without PCA. I'm using Faiss now with ~500M vectors, and it works great!

With how many dimensions are you using Faiss with 100m+ vectors? I’m currently looking a solution to handle 1024 dimensions for ~100m items.

On one index I'm using OPQ16_64,IVF262144_HNSW32,PQ16 with 128 dimensions initially.

1024 dimensions is a lot! Could you elaborate on what application requires that many? If it's a DNN layer output, your data must be sparse, so dimensionality reduction won't affect your recall if tuned properly.

It's actually a DNN layer output. I haven't considered dimensionality reduction, yet. Thanks for pointing my there, I'll look into it. Probably thats the better way to go.

Thanks a lot for your reply!

Fun fact, there’s a bug in the implementation where you can create cubes with much higher dimensions because one constructor doesn’t do the check.

Still wouldn’t recommend it though.

I had the same initial thought based on the title. Unfortunately, the answer is no.

The article discusses a low-dimensional KNN problem. The curse of dimensionality guides intuition that the methods here likely will not apply to extremely high-dimensional problems.

faiss actually comes with a lot of excellent documentation that describes the problems unique to KNN on embedding vectors. In particular, for extremely large datasets, most of the tractable methods are approximations that make use of clustering, quantization, and centriod-difference tricks to make computation efficient.

See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes and related links for more information.

Its point structure supports 4 dimensions, but I'm not sure if the Z and M components are part of the distance calculation.


Edit: The distance operations are 2D by default and it looks like there are only 3D variants.


this is a special case of reading the internal GIST spatial index used in PostGIS by the implentation code for operator '<->' , so no joy for N-dimensional search..

You can use a python library inside PostgreSQL using plPython, but supplying the coordinates to the evaluation is not going to be as compact and specialized as this

Super curious about this as well.

I find it strange that Postgres has features like these, but lacks bit_count, which is required for hamming distance. It's easy to write a C-Language User Defined Function, but then you can't use RDS.

One of greatest things about Postgres is its extension system which is meant for things like this. RDS not supporting custom extension is RDS problem and frankly I wouldn't recommend using RDS. You are losing a lot of functionality, upgrades are tougher etc. Postgres maintenance is quite low the hardest part is probably running and RDS doesn't save you from that.

Anyway this feature discussed (searching based on a distance) is part of PostGIS extension they are demonstrating here SP-GiST index.

> RDS not supporting custom extension is RDS problem and frankly I wouldn't recommend using RDS.

We are using RDS and would like to have a standard PostgreSQL instead. Is there some community supported AMIs that could help us to get started quickly?

FWIW, if you have to ask this question then perhaps the ~20-30% cost overhead of RDS is worth not having to handle the sysops layer? If you want to, Bitnami and others have AMIs on the marketplace that have various versions of PG installed: https://aws.amazon.com/marketplace/pp/B01LWIX1ZK?qid=1573152...

> We are using RDS and would like to have a standard PostgreSQL instead.

Having not used Postgres RDS, what about it makes you want to have a standard Postgres instead?

I think RDS has some very strong pros:

  - not so bad pricing
  - good defaults
  - no need to worry regarding the backups and easy disaster recovery
  - some monitoring out of the box
  - support TLS with a valid cert out of the box
  - easy replication
I'm not an ops so not having to worry about the operational aspects of our datastore was great.

Here are some pain-points we have, I'm probably forgetting some but those were really painful:

- You don't have a superadmin access, we had several issues where some tables, sequences or objects where associated with a user we wanted to delete and this was really hard, I think we were supposed to change the owner of each objects one by one. [0] shows the issue. In the end, we solved it by dumping the database, deleting the user, recreating the database. Thankfully, this wasn't in prod.

- Many extensions cannot be installed, for example Timescale. It's a bit sad having PostgreSQL and not being able to take full advantage of it.

What I regret about those is that there is nothing you can do, for the first point I looked everywhere and many have this issue but there is nothing you can do to solve it easily, there is posts both on StackOverflow and PostgreSQL mailing list but no easy solution, you have to track one by one the object whose owner you want to delete and reassign them. For the second, it means you can't use many useful extensions like Timescale or anything not yet released.

If there was some AMI with PostgreSQL already installed and correctly configured on which we could just connect an EBS, we could migrate very easily.

[0] https://stackoverflow.com/questions/44718654/cant-delete-pos...

Select the official AMI for you preferred linux distro and use the package manager to install postgresql?

It's not really about specific AMI, you can install postgresql on any system, and if you use one of these[1] you can enjoy official packages.

One of nice things about RDS is seamless replication and backups, this is not available out of the box, and you kind of need to set it up yourself, it is not that hard, especially if you have a CMS (such as salt, chef) helps a lot with it.

There is already RDS functionality available through open source, if you're on premises I would highly recommend repmgr[2] and barman[3] the former is making setting replication and failover easy, the later is performing streaming backups and PITR.

In AWS in place of barman, you could perhaps use WAL-E[4] or WAL-G[5] these basically do the same thing barman does, but instead of running an instance that keeps the backup WAL-E and WAL-G stores backups directly in S3.

There's also Patroni[6] which is an alternative to repmgr. I have used personally repmgr and barman, and there wasn't really much overhead maintaining it. Although in public cloud the other solutions might be more suited?

The nice thing for maintaining your own setup is greater control over the database, for example you have full control over what extensions you use or which version of PG you deploy. Many changes that in RDS require server restart, you can just restart the process and have lower downtime. In my company we also went with Aurora PG 9.6 there's already Aurora PG 10, but there's currently no upgrade path without taking everything down, backing up and restoring on another instance. Supposedly AWS is working on it, but they won't tell when it is available. Also there's no PG 11 or heh, 12. With own PG setup also major version upgrade is quite quick, I think it took 2 minutes, if you were sure you have backups and performed upgrade in place.

Honestly what RDS offers isn't something that isn't available outside, it's just that might require some prior investment, in our case was writing salt roles that set up and configured the nodes.

[1] https://www.postgresql.org/download/

[2] https://repmgr.org/

[3] https://www.pgbarman.org/index.html

[4] https://github.com/wal-e/wal-e

[5] https://github.com/wal-g/wal-g

[6] https://github.com/zalando/patroni

Thanks, this was exactly what I was looking for. I'll start experimenting with it :)

How are upgrades tough with RDS? It's always just been switching a configuration setting, applying, then wait.

Yeah, Upgrades are far simpler on RDS than HW machines, or especially Docker images. I did an upgrade of a 1.5TB database from 9.6-> 11.x on RDS on a severely undersized DB machine in about 15 minutes. Pointy clicky and done.

Docker images are their own special version of hell, and I've yet to see one with all of the appropriate bits in place to do a migration without a dump+reload.

Personally, If you're doing the AWS thing and you have a PG design that will do RDS, I'd use that and don't look back. If you've got untrusted languages or other tweaks, you're kind of stuck on your own.

Is there any way to get the closest driving distance in Postgres without relying on some API?

I haven't done anything like this, bit it is possible to use the data from OpenStreetMap locally along with Postgres to e.g. sort results by shortest driving distance?

Maybe there's an extension? I'm just curious.

You might want pgrouting[1], if you can find access to appropriate datasets for your routing (road) network.

[1] http://workshop.pgrouting.org/2.5/en/chapters/advanced.html#...

That's really great, thanks.

I'm a little confused as to what this does. Is this just Euclidean distance, or is it meant to be used for longitude/latitude? The article doesn't make it clear.

this is distance in units of degree 'Euclidean' .. not so great, except that the answer is expected to be the "closest" point, which tends to be smaller and need less precision.. not useful for measurement in meters

If there is going to be any system capable of supporting creating something like The Matrix, it’s going to be Postgres. Tries to eat all application logic by suggested move in database, use language extensions.

I’ve only ever found it to be a false promise. You get locked inside the db and can’t create application logic outside the db, and the internal language extensions are super poor solutions for this if you need multiple environments, multiple packages, different versions of the same extension-based app.

I love Postgres as a db, but I feel like all the viral marketing for moving to internal db app logic, especially with extensions, is a big trap, and the tutorials & blog posts give you some really superficial use cases to motivate it, and in reality you’re speeding toward a cliff.

If you have a problem that is never going to be google scale, and you have people who are familiar with postgres, you can deliver applications that are a fraction of the code with superior performance.

The only problem with db-centric applications is the tooling, which is inferior to standard application tooling at this point, in terms of version control and testability. Given the hype train building around postgres as a platform, I expect this is going to change pretty quickly.

Disagree. For example, putting machine learning in the database is miserable. The same is true for other X besides ML. Does not matter what the scale is.

We have found in database ML to be great to productionalize basic machine-learning methods. We use BigQuery ML and it saved us a ton of fussing with various frameworks, and because we were already in BQ it saved us the moving of data from database to ML cluster and back into database. We simply author some SQL and get great results quickly.

That strikes me as bad usage of ML algorithms for quite a few reasons. People who say this kind of thing usually are not ML practitioners and want to drink some Kool Aid that they can save money & avoid hiring them with techniques like this.

How are you benchmarking the in-DB models against variations that can improve them? If they are basic regression models, how are you solving nuanced problems relating to coefficient estimates that cannot be solved with plug and chug libraries (such as from the article “Let’s Put the Garbage Can Regressions and Garbage Can Probits Where They Belong” by Achen)?

I’d venture to guess you’re assuming one particular type of model just works, and either are unaware it is failing, or are unaware of wasted money left on the table by not improving or understanding it.

This is extremely common especially with simpler use cases. For example, you might think A/B testing is easily solved as an abstract problem and you just need a framework. What could be hard? I’d argue that A/B testing is a great example of something nobody should be buying from a vendor or getting from unexamined library output, and it’s extremely easy to be losing a bunch of money on junk tests and not even understand your tests are junk.

It only becomes much more true with OLS & logistic regression, decision trees, SVMs. You need customized systems operated by people who actually know statistics. Embedding this inside a DB where you cannot seriously develop software to customize the models & analysis just doesn’t work. The cost of serializing the data elsewhere is not the salient one.

I've never heard of anyone developing IN the database, rather it's a place where you deploy code, so you can do fun things like spectral_cluster((select ...)) or gp_predict_value((select ...)) in the db, avoiding needless serialization and transfer.

Even if you have google-scale problems... Google, after all, has built a google-scale SQL.

I'd say the real value of the 'application-like' capabilities of Postgres is for analytics and offline tasks where there may not even be an application to 'eat'. This article is a great example, anyone can get this working on their laptop easily and just feed new data into the structure they have created and trust that it will just work.

Now, if this is powering some actual business logic then that's can cause problems since now your application is tightly coupled to a specific implementation in a specific database technology that may or may not be testable, versioned, or observable. But if you're a data hacker trying to answer questions, produce a report, or tease out features for a ML model having a such a robust easy to manage tool in your toolbelt is super useful.

Application Postgres and Analytics Postgres are really different animals, it's pretty cool that one piece of software can do both so well and be so many different things to different people, but yeah I agree you can into real trouble if you use the wrong part of the tool for the job.

Yes, almost every software or business presentation i see these days is repeating this sentiment: "Postgres is eating the world!" and "Every company is a Postgres company!"

Is that a bad thing?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact