Performance for GIST and SP-GIST are in fact three orders of magnitude better than what the author summarized!
It's a pretty cool extension that does distance between points, intersections between n-dimensional cubes (hence the name), different distance metrics etc.
It'd be perfect for storing and searching through large amounts of n-dimensional embeddings, I'm guessing it's used for that already.
1024 dimensions is a lot! Could you elaborate on what application requires that many? If it's a DNN layer output, your data must be sparse, so dimensionality reduction won't affect your recall if tuned properly.
Thanks a lot for your reply!
Still wouldn’t recommend it though.
The article discusses a low-dimensional KNN problem. The curse of dimensionality guides intuition that the methods here likely will not apply to extremely high-dimensional problems.
faiss actually comes with a lot of excellent documentation that describes the problems unique to KNN on embedding vectors. In particular, for extremely large datasets, most of the tractable methods are approximations that make use of clustering, quantization, and centriod-difference tricks to make computation efficient.
See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes and related links for more information.
Edit: The distance operations are 2D by default and it looks like there are only 3D variants.
You can use a python library inside PostgreSQL using plPython, but supplying the coordinates to the evaluation is not going to be as compact and specialized as this
Anyway this feature discussed (searching based on a distance) is part of PostGIS extension they are demonstrating here SP-GiST index.
We are using RDS and would like to have a standard PostgreSQL instead. Is there some community supported AMIs that could help us to get started quickly?
Having not used Postgres RDS, what about it makes you want to have a standard Postgres instead?
- not so bad pricing
- good defaults
- no need to worry regarding the backups and easy disaster recovery
- some monitoring out of the box
- support TLS with a valid cert out of the box
- easy replication
Here are some pain-points we have, I'm probably forgetting some but those were really painful:
- You don't have a superadmin access, we had several issues where some tables, sequences or objects where associated with a user we wanted to delete and this was really hard, I think we were supposed to change the owner of each objects one by one.  shows the issue. In the end, we solved it by dumping the database, deleting the user, recreating the database. Thankfully, this wasn't in prod.
- Many extensions cannot be installed, for example Timescale. It's a bit sad having PostgreSQL and not being able to take full advantage of it.
What I regret about those is that there is nothing you can do, for the first point I looked everywhere and many have this issue but there is nothing you can do to solve it easily, there is posts both on StackOverflow and PostgreSQL mailing list but no easy solution, you have to track one by one the object whose owner you want to delete and reassign them. For the second, it means you can't use many useful extensions like Timescale or anything not yet released.
If there was some AMI with PostgreSQL already installed and correctly configured on which we could just connect an EBS, we could migrate very easily.
One of nice things about RDS is seamless replication and backups, this is not available out of the box, and you kind of need to set it up yourself, it is not that hard, especially if you have a CMS (such as salt, chef) helps a lot with it.
There is already RDS functionality available through open source, if you're on premises I would highly recommend repmgr and barman the former is making setting replication and failover easy, the later is performing streaming backups and PITR.
In AWS in place of barman, you could perhaps use WAL-E or WAL-G these basically do the same thing barman does, but instead of running an instance that keeps the backup WAL-E and WAL-G stores backups directly in S3.
There's also Patroni which is an alternative to repmgr. I have used personally repmgr and barman, and there wasn't really much overhead maintaining it. Although in public cloud the other solutions might be more suited?
The nice thing for maintaining your own setup is greater control over the database, for example you have full control over what extensions you use or which version of PG you deploy. Many changes that in RDS require server restart, you can just restart the process and have lower downtime. In my company we also went with Aurora PG 9.6 there's already Aurora PG 10, but there's currently no upgrade path without taking everything down, backing up and restoring on another instance. Supposedly AWS is working on it, but they won't tell when it is available. Also there's no PG 11 or heh, 12. With own PG setup also major version upgrade is quite quick, I think it took 2 minutes, if you were sure you have backups and performed upgrade in place.
Honestly what RDS offers isn't something that isn't available outside, it's just that might require some prior investment, in our case was writing salt roles that set up and configured the nodes.
Docker images are their own special version of hell, and I've yet to see one with all of the appropriate bits in place to do a migration without a dump+reload.
Personally, If you're doing the AWS thing and you have a PG design that will do RDS, I'd use that and don't look back. If you've got untrusted languages or other tweaks, you're kind of stuck on your own.
I haven't done anything like this, bit it is possible to use the data from OpenStreetMap locally along with Postgres to e.g. sort results by shortest driving distance?
Maybe there's an extension? I'm just curious.
I’ve only ever found it to be a false promise. You get locked inside the db and can’t create application logic outside the db, and the internal language extensions are super poor solutions for this if you need multiple environments, multiple packages, different versions of the same extension-based app.
I love Postgres as a db, but I feel like all the viral marketing for moving to internal db app logic, especially with extensions, is a big trap, and the tutorials & blog posts give you some really superficial use cases to motivate it, and in reality you’re speeding toward a cliff.
The only problem with db-centric applications is the tooling, which is inferior to standard application tooling at this point, in terms of version control and testability. Given the hype train building around postgres as a platform, I expect this is going to change pretty quickly.
How are you benchmarking the in-DB models against variations that can improve them? If they are basic regression models, how are you solving nuanced problems relating to coefficient estimates that cannot be solved with plug and chug libraries (such as from the article “Let’s Put the Garbage Can Regressions and Garbage Can Probits Where They Belong” by Achen)?
I’d venture to guess you’re assuming one particular type of model just works, and either are unaware it is failing, or are unaware of wasted money left on the table by not improving or understanding it.
This is extremely common especially with simpler use cases. For example, you might think A/B testing is easily solved as an abstract problem and you just need a framework. What could be hard? I’d argue that A/B testing is a great example of something nobody should be buying from a vendor or getting from unexamined library output, and it’s extremely easy to be losing a bunch of money on junk tests and not even understand your tests are junk.
It only becomes much more true with OLS & logistic regression, decision trees, SVMs. You need customized systems operated by people who actually know statistics. Embedding this inside a DB where you cannot seriously develop software to customize the models & analysis just doesn’t work. The cost of serializing the data elsewhere is not the salient one.
Now, if this is powering some actual business logic then that's can cause problems since now your application is tightly coupled to a specific implementation in a specific database technology that may or may not be testable, versioned, or observable. But if you're a data hacker trying to answer questions, produce a report, or tease out features for a ML model having a such a robust easy to manage tool in your toolbelt is super useful.
Application Postgres and Analytics Postgres are really different animals, it's pretty cool that one piece of software can do both so well and be so many different things to different people, but yeah I agree you can into real trouble if you use the wrong part of the tool for the job.