Hacker News new | past | comments | ask | show | jobs | submit login
Lesser-known Postgres features (hakibenita.com)
702 points by CRConrad 17 days ago | hide | past | favorite | 166 comments



I want to stress the importance of not using

  id int SERIAL
If you are on a somewhat recent version of postgres, please do yourself a favor and use:

  id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
An "identity column", the part here:

https://hakibenita.com/postgresql-unknown-features#prevent-s...

You might think this is trivial -- but SERIAL creates an "owned" (by a certain user) sequence behind the scenes, and so you run into massive headaches if you try to move things around.

Identity columns don't, and avoid the issue altogether.

https://www.2ndquadrant.com/en/blog/postgresql-10-identity-c...


I'll stop short of giving a recommendation or using the word "should", but ill give encouragement to consider using uuid's for keys. I have used them in several systems and have never had any issues with them, and they solve so many issues. The ability to generate a key on the client or on the server or in the database is great for one. And the fact that keys are unique not only in a table but in the system (or many systems) is a huge advantage.


Yeah I recognize the arguments for UUID keys:

  - Avoids people being able to just iterate through records, or to discover roughly how many records of a thing you have
  - Allows you to generate the key before the row is saved
I think I default to auto-increment ID's due to:

  - Familiarity bias
  - They have a temporal aspect to them (IE, I know row with ID 225 was created before row with ID 392, and approximately when they might be created)
  - Easier to read (when you have less than +1,000,000 rows in a table)
I agree and think you're right in that UUID's are probably a better default.

Though you can never find a "definitive" guide/rule online or in the docs unfortunately.


> They have a temporal aspect to them (IE, I know row with ID 225 was created before row with ID 392, and approximately when they might be created)

UUIDv7 (currently a draft spec[0]) are IDs that can be sorted in the chronological order they were created

In the meantime, ulid[1] and ksuid[2] are popular time-sortable ID schemes, both previously discussed on HN[3]

[0] https://datatracker.ietf.org/doc/html/draft-peabody-dispatch...

[1] https://github.com/ulid/spec

[2] https://github.com/segmentio/ksuid

[3] ulid discussion: https://news.ycombinator.com/item?id=18768909

UUIDv7 discusison: https://news.ycombinator.com/item?id=28088213


I had reviewed existing UUIDv7 implementations and many were incorrect or had subtle timing bugs.

We ended up implementing UUIDv7 in our ID generation library https://github.com/MatrixAI/js-id. And we have a number of tests ensuring that it is truly monotonic even across process restarts.

See IdSortable.


Yup this. Ulid is great, and implemented in many languages. For example there’s a crate named Ulid that you can use that gives you ulid’s in Rust.


Came here to also hype ULID. It's just like INT, but 128bit, timestamps, every lang, and moving data around is trivial because no ID ever collide with anything else ever. Makes finding crap in my logs easier. (But UUID has that feature too) - it's the sorting that does it for me.


Is there a PostgreSQL extension for ulid?


I never found one. I am really interested in it too, but I bwlieve PG ecosystem will support uuidV7 before considering ulid.


Here's one[1], not actively maintained though.

[1] https://github.com/edoceo/pg-ulid


Hmmm that's really useful :) . thanks!


Wow, neat. TIL


Something I always liked but have never done -- start auto-increment IDs at some number other than 1. I worked on a system where account IDs started with 1000000, invoice IDs started with 2000000, etc. That way, if you saw a random number laying around like "1001234", you knew it was an account ID. "2000123", an invoice! I don't remember how often it helped things, but it was more than 0 times, and it always "felt" good. I never loved the implicit upper bound, but it was also never a problem.

(And remember, the values are in different tables, so the only disadvantage is that your millionth user has ID 2000000. To the trained eye they look like an invoice; but there is no way for the system itself to treat that as an invoice, so it's only confusing to humans. If you use auto-increment keys that start at 1, you have the same problem. Account 1 and Invoice 1 are obviously different things.)


> but there is no way for the system itself to treat that as an invoice, so it's only confusing to humans.

I agree in principle, but then you have Windows skipping version 9 because of all of the `version.hasPrefix("9")` out in the world that were trying to cleverly handle both 95 and 98 at once. If a feature of data is exploitable, there's a strong chance it will be exploited.


I use this for internal items, as in: real customer IDs start at 10,000, so we have 9,999 possible users/invoices/whatever for our own usage in the database. That makes it easy to filter demo accounts of your sales employees, or production test accounts, in queries: just add an offset.

Re: the upper bound: if you reach a million customers, you have lots of other nice problems, like how to spend all the money they earn you :-)


This is clever.

You could use "is_test_account" or whatnot, but that adds an unnecessary column (for ~99% of records)

I like this -- will keep it in mind!


I worked on an existing database once that stored geographic locations and used the signed bit for namespacing.

Positive IDs were home-grown, negative ones were Unique Feature Identifiers from some old GNS system import (from some ancient US Government/Military dataset).

Then when something had to be edited/replaced/created it would possibly change a previously negative ID to positive, fun times.


I’ve seen systems that had to track only two types of things - one started at half the value and went up; the other started just below that and went down. Think 500000+ vs 499999-


I prefer UUIDs as well but one other benefit of an integer type key is in the index. UUIDs, especially v4, make for bloated indexes. If there’s a very specific performance reason I’ll use int, otherwise uuid.


One of hidden pros is that UUID has no gaps visible to product owner or manager. More than once in my life I have try convince people, that the gap in sequence caused by transaction rollback is OK. Several times we were «fixing data» in db manually


This is what I do as well. Most schema structure is on tables that are not very large, and it's nice especially when dealing with multiple environments to prevent errors, or to produce schemas with some polymorphism or joins between several tables without a chain of id-resolving joins in-between.

There are UUID variants that can work well with indices, which shrinks the case for big-integers yet further, to micro-optimizing cases that are situational.


You can use sequential guids. That's what I do. Works well for not bloating indices.


The person you're replying to didn't even mention those advantages though! The ones they did mention are more significant to me.

Of your disadvantages... if I wanted to know when rows were created, I'd just add a created timestamp column.

But "easier to read" is for real -- it's easy when debugging something to have notes referencing rows 12455 and 98923823 or in some cases even keep them in your head. But UUIDs are right out.

And you definitely don't want to put UUIDs in a user-facing URL -- which is actually how you get around "avoids people being able to just iterate through records", really whether you have UUIDs or numeric pks I think putting internal PK's in URLs or anything else publicly exposed is a bad idea, just keep your pks purely internal. Once you commit to that, the comparison between UUIDs vs numeric as pks changes again.


> I think putting internal PK's in URLs or anything else publicly exposed is a bad idea

Why?

Both as a user and as a dev I love that I can just change the PK to get to a specific post, item, whatever instead of changing the whole link.


I mean, the links are going to be a template either way, so you can change the identifier part to change the item; it's just a question of whether the identifier portion is the internal rdbms PK, or something else.

It's considered undesirable because it's basically an "implementation detail", it's good to let the internal PK change without having to change the public-facing URLs, or sometimes vice versa, when there are business reasons to do so.

And then there are the at least possible security implications (mainly "enumeration attacks") and/or "business intelligence to your competitors" implications too.

But it's true that plenty of sites seem to expose the internal PK too, it's not a complete consensus. Just kind of a principle to separate internal implementation from public interface.

Here's a short 2008 HN discussion on it, in which people have both opinions: https://news.ycombinator.com/item?id=19876901

Here's a more recent and lengthy treatment:

https://czep.net/21/obfuscate.html


Thanks, especially those links are helpful, will read through it :)


Random keys have the positive aspect of not creating any hotspots, and they have the negative aspect of not creating any hotspots.

So if you have concurrent updates which are correlated with row insertion time, random keys can be a win. On the other hand, if your lookups are correlated with row insertion time, then the relevant key index pages are less likely to be hot in memory, and depending on how large the table is, you may have thrashing of index pages (this problem would be worse with MySQL, where rows are stored in the PK index).

(UUID doesn't need to mean random any more though as pointed out below. My commentary is specific to random keys and is actually scar tissue from using MD5 as a lookup into multi-billion row tables.)


I generally use both - a serial for internal RDBMS storage to coerce writes to be in order, and a secondary GUID that is app-specific and generated. The app will never care about the serial, and the RDBMS doesn't care about the GUID other than secondary index look-ups (since they're rarely used for range scans).


Be wary when using guids for PK in clustered indices, in databases that support them, see for example https://stackoverflow.com/questions/11938044/what-are-the-be... . Sadly, Postgresql doesn't have them maintained on-the-fly like Ms SQL does so it is less used feature.


Uuidv4 have worse performance when inserting into the btree for the primary key.


That is true, but that has never been a bottleneck for me. We have a table with multiple hundreds of millions of rows, 20ish UUIDs indexes (yes it did get out of hand) and insert time isn't an issue.

Ofc, your app performance requirements might vary, and it is objectively true that UUIDs aren't ideal for btree indexes.


Then you are lucky. I once found an insert performance drop of sometimes 5x once a table reaches 500,000 elements or so with BTree's and Uuids in particular. Problem is: UUID's often scale well on the app side, especially with multiple writers so the app wants to use them.

Unless you are using a time sorted UUID, and you only do inserts into the table (never updates) avoid any feature that creates a BTree on those fields IMO. Given MVCC architecture of Postgres time sorted UUID's are often not enough if you do a lot of updates as these are really just inserts which again create randomness in the index. I've been in a project where to avoid a refactor (and given Postgres usage was convenient) they just decided to remove constraints and anything that creates a B-Tree index implicitly or explicitly.

It makes me wish Hash indexes could be used to create constraints. They often use less memory these days in my previous testing under new versions of Postgres, and scale a lot better despite less engineering effort in them. In other databases where updates happen in-place so as not to change order of rows (not Postgres MVCC) a BRIN like index on a time ordered UUID would be often fantastic for memory usage. ZHeap seems to have died.

Sadly this is something people should be aware of in advance. Otherwise it will probably bite you later when you have large write volumes, and therefore are most unable to enact changes to the DB when performance drastically goes down (e.g. large customer traffic). This is amplified because writes don't scale in Postgres/most SQL databases.


Why would you use a btree for them? Wouldn’t a hash index be ideal?


> Why would you use a btree for them?

1. because PRIMARY KEY is its own constraint, and the underlying index is not under you control

2. because PRIMARY KEY further restricts UNIQUE, and as of postgres 14 "only B-tree indexes can be declared unique"


Some databases don't give you a choice, and your table data is actually stored in a clustered index sorted by primary key. This means random inserts when using a UUID (without embedded time), and more page splitting.


Use sequential guids.


Technically there are many advantages, but operationally, I find it extremely useful in systems where an int PK is auto-generated to use it as a pretty good proxy for a relative created time with relation to other records, with built-in indexing that avoids having to index on an actual created datetime column for a lot of less precise lookup purposes. The backend frameworks I use around data access make the security benefits of UUIDs pretty moot.


As someone below suggests, ULID is a good alternative which has a temporal aspect in the high bits so it is still a proxy for relative creation time.


Sometimes you want your keys to be smaller, so that more of your important indexes would fit in RAM, making searches and joins faster.


What will really be nice is when UUID V7 has wide support, because in many ways it is the best of all possible worlds:

1. The UUIDs are generated in increasing order, so none of the b-tree issues others have mentioned with fully random UUIDs.

2. They're true UUIDs, so migrating between DBs is easy, ensuring uniqueness across DBs is easy, etc.

3. They also have the benefit of having a significant amount of randomness, so if you have a bug that doesn't do an appropriate access check somewhere they are more resistant to someone trying to guess the ID from a previous one.


A friend of mine recently recommended CUIDs, which I'd never heard of before. Any thoughts on those? Neither usecuid.org nor https://www.npmjs.com/package/cuid give me much confidence, though (the former is not https, and the latter has not been updated in 2 years... which might mean it's done, or might mean it's abandoned).


>> I'll stop short of giving a recommendation or using the word "should" ...

I beleive the word you are looking for is "ought". :)


One huge factor for me: assuming you stick to v4, you can use UUIDs directly as identifiers in public contexts without leaking any business-related info and without having predictable keys.


Do you write logic to handle collisions?


"A collision is possible but the total number of unique keys generated is so large that the possibility of a collision is almost zero. As per Wikipedia, the number of UUIDs generated to have atleast 1 collision is 2.71 quintillion. This is equivalent to generating around 1 billion UUIDs per second for about 85 years."


For clarification, that's the number of UUIDs at which you have a 50% chance of at least one collision.

Wikipedia also adds: "the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion."


part of the point of UUIDs is that you don't have to. Collision is unlikely enough that your time would be more wisely spent worrying about the sun exploding unless you generate an absolutely absurd amount of data.


Unless you are writing nuclear warhead management system, writing UUID collision handling is waste of time. Client can retry on top level if request failed.


expecting collision to happen is probably a waste of time, but it still sounds like a good idea to bound how much damage a collision might cause.


My guess is that Iin 99% collision could occur during insert. Most of the time db will just not allow to insert duplicated record.


my worries were more about access control, it is sort of fine if a costumer experiences data loss because an insert fails and the application doesn't retry, it is less fine if a collision causes a user's documents to be swapped with another user's docoment and they end up showing kinky porn on a live press conference.

Sort of the distinction between unspecified behaviour and undefined behaviour in C.


The application shouldn't report the insert as successful if it actually failed. That way, the user don't go around thinking the insert actually succeeded, and there is no data loss (if it didn't succeed it must be retried, by the app or manually)

You need some UX like an error message in a red rectangle or something.


there would be data loss if the insert/update would meet a collision and either decided to overwrite the record or to report the operation as already completed (the latter is what I imagine git would do for a collision scenario).

the solution might as well just be not to care about this case (no sarcasm).


But if we're talking about a database insert and there's a collision (and the uuid column has an unique constraint - as it should, since it's meant to be a primary key), then the insert will not be successful.

Your application might not care about treating this error, but the DB will report it.


That syntax seems a bit verbose for a column that will most likely appear on every table you ever create. How about:

    id int SIMILAR TO SERIAL BUT WITHOUT PROBLEMS MOVING THINGS AROUND
And the other variant for when you aren’t sure that worked:

    id int SIMILAR TO SERIAL BUT WITH EVEN FEWER PROBLEMS MOVING THINGS AROUND


Complain to the SQL standard authors, not the Postgres developers :)


Another favor you should do yourself is use bigint keys (unless using uuid as suggested elsewhere, or you're very, very sure you'll never chew through 2 billion ids).


I often move ID generation into the application layer (this also helps avoid things like enumeration attacks), and actually quite a lot of cool Postgres features blur that line a little bit. It's interesting to think sequences and other computational mechanisms in a DB, and whether they make architecting applications easier or harder. I don't have a strong opinion either way, but I'm interested in HN's opinion.


One often hears the counterargument 'but using DB-specific features makes your application less portable!' to which I like to argue: When was the last time you moved an application from SQL-db to SQL-db engine? Follow up question: When was it ever 'easy' if you did?

If you start from the basic premise that the database engine and the application are intertwined and are not loosely coupled, using Postgres-specific features feels much less icky from an architectural point of view. They are essentially part of your application given that you use something like Flyway for migrations and you don't manually run SQL against your production to install functions and triggers and such.


So, I did an easy migration of an analytics app from elasticsearch to ClickHouse: what made it easy was (a) we could replay an event stream to regenerate the database and (b) we had interfaces around the database defined in terms of the questions we needed the database to answer, and not in terms of ES-specific concepts.

But, what makes this sort of design so nice is that you can use DB-specific stuff behind the interface because you’re not trying to write all your queries in the minimally supported subset of SQL or something.


I've given up on believing that there is such a thing as portable SQL. If you use the lowest common denominator, you're likely to have queries that perform suboptimally. Hell, even temp tables aren't portable because of the different syntax between databases. I've worked with tons of queries that would would take minutes to run without temp tables while running in milliseconds with them. People may as well take advantage of all the nice things their database offers instead of fearing a situation that is unlikely to happen and which will be a tough change anyway.


Adding on, even when you are trying to make a portable application, you tend to want to make it work on either X or Y, and you can still exclude people from switching their existing data from one to the other.

I've used software that at least tried to be portable, so you could install it with whatever database you had available.


I did not cite reasons of portability for this reason, and generally I agree with you. Also some of the non-portable extensions of Postgres like PostGIS are practically indispensable for certain applications.

A more reasonable argument for application layer ID generation is flexibility. It's likely that I want some specific primary key generation (e.g. by email, or username, or username prepended with role, and so on).


Done it multiple times, most recently we got an MVP off the ground using only ElasticSearch as a backend, since the primary use case we wanted to validate and demonstrate was discoverability.

As we added more features, the lack of transactions and relational structures started to slow us down, so we dropped in Postgres as a backend, and having application-generated UUID4s as primary keys was a big part in making that move fairly painless


We're currently in the process of moving from MongoDB to CockroachDB. It's not been easy. Certain persistence needs to be duplicated for both of them. Certain queries need to be compared between the two, and we need to make a conscious decision when we pick from which DB.

Having said that, moving logic for id generation to the application because it's less portable otherwise is an odd reason.


>>> [DB Sequences]

> One often hears the counterargument 'but using DB-specific features makes your application less portable!'

OK, sorry for the probably stupid question: Isn't it just a matter of, for each sequence, selecting its current value and then creating the new one in the target database to start from there? Should be, if perhaps not easily, still reasonably scriptable... Or what am I missing?


For backends, I typically use ID generation in the application layer (the backend). And with ID generation I mean the HiLo generator from Hibernate (or NHibernate). By default the HiLo generator will still use a monotonic increasing number, possibly with gaps. (Either way, as long as you expose some form of monotonic increasing number in your api, enumeration attacks could work. The correct way is to simply not expose the original id if that is an issue: you could expose some kind of opaque token in the api instead of a literal id.) The advantage of the HiLo generator is that the ORM knows the generated id up-front and does not need a round-trip to the database to determine it. So it can generate more performant queries. Also, contrary to UUIDs it does not have the disadvantages regarding indices. The only reason I have ever used UUIDs was because of a requirement that external clients needed to generate the ids.


>You might think this is trivial -- but SERIAL creates an "owned" (by a certain user) sequence behind the scenes, and so you run into massive headaches if you try to move things around.

Maybe it is because I'm too old, but making id grow by sequence is the way how things 'ought' to be done in the old skool db admin ways. Sequences are great, it allows the db to to maintain two or more sets of incremental ids, comes in very handy when you keeping track of certain invoices that needs to have a certain incremental numbers. By exposing that in the CREATE statement of the table brings transparency, instead of some magical blackbox IDENTITY. However, it is totally understandable from a developer's perspective that getting to know the sequences is just unneeded headache. ;-)


Why? Concrete ownership is a good thing. I prefer more rigid structures and I like designing my tables around that. It was the Wild West when I did database design around MySQL 15ish years ago. Of course I didn’t know nearly as much as I do now back then.


There is still ownership, it is just that the sequence is owned by the table, and not the user that created the table.


From TFA:

    Instead of using GENERATED BY DEFAULT, use GENERATED ALWAYS.


Regarding "Match Against Multiple Patterns", the examples are about finding the _suffixes_ of something, email domains in the example.

An attempt to find a suffix like that will not be able to use an index, whereas creating a functional index on the reverse and looking for the reversed suffix as a prefix will be:

  # create table users (id int primary key, email text);
  CREATE TABLE
  # create unique index on users(lower(email));
  CREATE INDEX
  # set enable_seqscan to false;
  SET
  
  # insert into users values (1, 'foo@gmail.com'), (2, 'bar@gmail.com'), (3, 'foo@yahoo.com');
  INSERT 0 3
  # explain select * from users where email ~* '@(gmail.com|yahoo.com)$';
                                  QUERY PLAN
  --------------------------------------------------------------------------
   Seq Scan on users  (cost=10000000000.00..10000000025.88 rows=1 width=36)
     Filter: (email ~* '@(gmail.com|yahoo.com)$'::text)

  # create index on users(reverse(lower(email)) collate "C"); -- collate C explicitly to enable prefix lookups
  CREATE INDEX

  # explain select * from users where reverse(lower(email)) ~ '^(moc.liamg|moc.oohay)';
                                               QUERY PLAN
  ----------------------------------------------------------------------------------------------------
   Bitmap Heap Scan on users  (cost=4.21..13.71 rows=1 width=36)
     Filter: (reverse(lower(email)) ~ '^(moc.liamg|moc.oohay)'::text)
     ->  Bitmap Index Scan on users_reverse_idx  (cost=0.00..4.21 rows=6 width=0)
           Index Cond: ((reverse(lower(email)) >= 'm'::text) AND (reverse(lower(email)) < 'n'::text))

(Another approach could of course be to tokenize the email, but since it's about pattern matching in particular)


Alternatively you could use a trigram index which i've used in the past successfully to speed up queries using like '%gmail.com'.

Blogpost discussing this approach: https://about.gitlab.com/blog/2016/03/18/fast-search-using-p...


Great article, always learn a lot from this author.

Here is another way I have used to do pivot tables / crosstab in postgres where you have a variable number of columns in the output:

https://gist.github.com/ryanguill/101a19fb6ae6dfb26a01396c53...

You can try it out here: https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=5dbbf7eadf0ed...


Yes, just the per-DB history was worth the price of admission. Too bad there is no RSS feed.



Thanks!


Another great hidden feature is the LISTEN/NOTIFY system. You can use it to watch for changes on a table, and then use that to e.g. update a view in the UI.

For example create a table with a trigger on insert "NOTIFY new_data". Then on query do

    LISTEN new_data;
    SELECT ...;
Now you'll get the results and any future updates.


Expanding on the regex trick: I often use [.] instead of \. to match a period.

They achieve the same thing, but [.] avoids the problem of your host language "helpfully" interpolating \. into . before sending the query to Postgres.


Two things that aren't exactly lesser-known, but that I wish more used continuously as part of development:

- generate_series(): While not the best to make _realistic_ test data for proper load testing, at least it's easy to make a lot of data. If you don't have a few million rows in your tables when you're developing, you probably don't know how things behave, because a full table/seq scan will be fast anyway - and you'll not spot the missing indexes (on e.g. reverse foreign keys, I see missing often enough)

- `EXPLAIN` and `EXPLAIN ANALYZE`. Don't save minutes of looking at your query plans during development by spending hours fixing performance problems in production. EXPLAIN all the things.

A significant percentage of production issues I've seen (and caused) are easily mitigated by those two.

By learning how to read and understand execution plans and how and why they change over time, you'll learn a lot more about databases too.

(CTEs/WITH-expressions are life changing too)


It's worth to note that earlier versions of PostgreSQL didn't include the "AS NOT MATERIALIZED" option when specifying CTE's. In our setup, this had huge hits to performance. If we were on a more recent version of PostgreSQL (I think 11 in this case), or if the query writer just used a sub-query instead of a CTE, we would have been fine.


Yep! A lot of older posts about CTEs largely advice against them for this reason.

Postgres 12 introduced controllable materialization behaviour: https://paquier.xyz/postgresql-2/postgres-12-with-materializ...

By default, it'll _not_ materialise unless it's recursive, or if there are >1 other CTEs consuming it.

When not materializing, filters may push through the CTEs.


Also explain(analyze,buffers) is by far my favorite. It shows you the number of pages loaded from disk or cache.

Also to note: EXPLAIN just plans the query, EXPLAIN (ANALYZE) plans and runs the query. Which can take awhile in production.


For the last point about "Find Overlapping Ranges", consider PostgreSQL range types [0] and their corresponding range operators [1].

[0]: https://www.postgresql.org/docs/current/rangetypes.html

[1]: https://www.postgresql.org/docs/current/functions-range.html


What I do is negate a check that determines if they don't overlap. Done.


On this first point — 'Get the Number of Updated and Inserted Rows in an Upsert' — unfortunately this `xmax` method is not always reliable and there does not seem to be any reliable alternative.

I posted to pgsql-hackers about this problem here: https://www.postgresql.org/message-id/DE57F14C-DB96-4F17-925...


I've seen similarly titled articles and they are usually full of the features that aren't used for a reason (MONEY column type, etc).

It's nice to see one that's full of good advice.


Im surprised there is no mention of foreign data wrappers, easily one of the best but lesser known features.

Would i use them in production? no. Are they fun to play around with? Yes!


Atlassian gives us sales figures as a large CSV. Nothing better than loading the CSV as a Postgres view with a foreign data wrapper, and performing pivots/joins/anything on it. Replace the file with a more recent one, and you even have versioning!


We use them in production, they work fine. You probably want to enable use_remote_estimate for postgres_fdw though.


Could you describe for what use case are you using foreign data wrapper in production?


I used a foreign data wrapper to query elasticsearch indexes from within postgres.[0]

It pushed alot of complexity down away from higher-level app developers not familiar with ES patterns.

[0]: https://github.com/matthewfranglen/postgres-elasticsearch-fd...


one place they are great in production is as part of a data migration. if you have an old database system with application code on top that you want to move to pure postgres, fdws provide a way to incrementally rewrite the application logic without ever having to do a hard cutover.


I may be wrong, but isn't "overlapping" just "meeting_a.starts_at<meeting_b.ends_at && meeting_a.ends_at>meeting_b.starts_at"?

Or in words: There is an overlap if meeting A starts before meeting B ends and meeting A ends after meeting B starts.

So the scenario described looks way more complex as it actual seems to be.


I believe it's about the common case of two events sharing a single (infinitely small) point of time: meeting A from 3 to 4, followed by meeting B from 4 to 5.

Those two time periods do overlap mathematically, but not physically: they share the common point of "4", even if that point has duration 0.

I know the iCal standard goes into this for a bit.


Don’t you just treat the range as half-open? Eg a '[)' range in Postgres.


If you use left-inclusive, right-exclusive intervals [), then then the overlap condition is:

meeting_a.starts_at <= meeting_b.ends_at && meeting_b.starts_at <= meeting_a.ends_at

So, indeed, the scenario as described in the article is more complex than needed.


I get the exactly same result with that query as the "overlap" query:

SELECT * FROM meetings, new_meetings WHERE new_meetings.starts_at<meetings.ends_at AND new_meetings.ends_at>meetings.starts_at;


Well, for this feature, it means if meeting a and meeting b share any point in time ( except their boundaries )


I have tried this with the condition I described and does not see any where it fails.


You'd fail on the exactly the same time case.


Both meeting A and B start at 2pm and end at 3pm.

After you're done with that corner case, you may find it better to determine if they don't overlap and negate that.


The upsert strategy mentioned (using ON CONFLICT to update instead of insert) has a catch, if your PK is a sequence it will increment even though the insert has failed.

It's not a huge deal in most cases, is a bigger issue if you are running any sort of scheduled ETL to update (or insert new) things.


Does it matter? In case of rollbacks, the sequence would increment too. You have to handle gaps anyway.


Careful with DISTINCT ON, it can be extremely slow, like 100x slower than a verbose and less readable alternative. But it might not matter in your use case.


I don't belive this is correct for postgres. It's true for other RDBMS, I believe. <https://medium.com/@smitagudale712/distinct-vs-group-by-in-s...>


I encountered this problem in Postgres 13.

https://blog.timescale.com/blog/how-we-made-distinct-queries...


A hand-written loose index scan sometimes helps in a case like this. I think it's not built-in yet.


This example didn't bring anything new to the table, only adds redundant extra chars:

    SELECT *
    FROM users
    WHERE email ~ ANY('{@gmail\.com$|@yahoo\.com$}')
Perhaps they were intending something similar to the following example instead. This one works but has a several potential lurking issues:

    with connection.cursor() as cursor:
        cursor.execute('''
            SELECT *
            FROM users
            WHERE email ~ ANY(ARRAY%(patterns)s)
        ''' % {
            'patterns': [
                '@gmail\.com$',
                '@yahoo\.com$',
            ],
        })
The dictionary-style interpolation is unnecessary, the pattern strings should be raw strings (the escape is ignored only due to being a period), and this could be a SQL injection site if any of this is ever changed. I don't recommend this form as given, but it could be improved.


> This example didn't bring anything new to the table, only adds redundant extra chars:

OP indicated as much saying:

> This approach is easier to work with from a host language such as Python

I'm with you on the injection - have to be sure your host language driver properly escapes things.


God damn I need to unlearn those because it will spoil my capability to develop non-postgres systems.


I’m crying in sql server land


One of the hardest types of queries in a lot of DBs is the simple `min-by` or `max-by` queries - e.g. "find the most recent post for each user." Seems like Postgres has a solution - `DISTINCT ON` - though personally I've always been a fan of how BigQuery does it: `ARRAY_AGG`. e.g.

  SELECT user_id, ARRAY_AGG(STRUCT(post_id, text, timestamp) ORDER BY timestamp DESC LIMIT 1)[SAFE_OFFSET(0)].*
  FROM posts
  GROUP BY user_id
`DISTINCT ON` feels like a hack and doesn't cleanly fit into the SQL execution framework (e.g. you can't run a window function on the result without using subqueries). This feels cleaner, but I'm not actually aware of any other DBMS that supports `ARRAY_AGG` with `LIMIT`.


I would use lateral join. Alternatively you can use window function instead of lateral join, buy from my experience lateral joins are usually faster for this kind of top-N queries in postgres.

select user_table.user_id, tmp.post_id, tmp.text, tmp.timestamp

from user_table

left outer join lateral (

select post_id, text, timestamp

from post

where post.user_id = user_table.user_id

order by timestamp

limit 1

) tmp on true

order by user_id

limit 30;


And we have groupArray aggregate function with limit.

> I'm not actually aware of any other DBMS that supports `ARRAY_AGG` with `LIMIT`.

So, put ClickHouse in you list :)


I've generally used `PARTITION BY` to achieve the same effect in big query (similar to what was suggested in the article for postgres).

Does the `ARRAY_AGG` approach offer any advantages?


I assume you mean `ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) as row_number` with a `WHERE row_number = 1`. We ran into issues with memory usage (i.e. queries erroring because of too high memory usage).

The `ARRAY_AGG` approach uses a simple heap, so it's as efficient as MIN/MAX because it can be computed in parallel. ROW_NUMBER however needs all the rows on one machine to number them properly. ARRAY_AGG combination is associative whereas ROW_NUMBER combination isn't.


That makes sense, I might have a play with the array agg approach. I'm especially curious if it has any impact on slot time consumed


In ClickHouse we have argMin/argMax aggregate functions. And also LIMIT BY. It looks like this:

    ... ORDER BY date DESC LIMIT 1 BY user_id


> ERROR: permission denied for table users

I would hesitate to call this a useful feature. This error message doesn’t tell you about the actual problem at all.

How about something like:

Permission denied, cannot select other than (id, name) from table users


That COPY trick is really neat. I've always used SELECT INTO.

    SELECT *
    INTO TEMP copy
    FROM foobar;
    \copy "copy" to 'foobar.csv' with csv headers


I'm always looking for new copy tricks. Is "copy" a temporary table? Or a variable...?


Temporary table. Although looking at the docs it seems CREATE TABLE AS is the recommended syntax. https://www.postgresql.org/docs/13/sql-createtableas.html

    CREATE TEMP TABLE "copy" AS
    SELECT * FROM foobar;


The upsert strategy mentioned (using ON CONFLICT to update instead of insert) is something we use internally quite a bit, esp in multi-insert transactions which were horrendous to handle without conflict resolution because any conflict would end up cancelling the entire transaction.

Though I wish there was an easy way to figure out how many times "CONFLICT" actually occured (i.e how many times did we insert vs update).


Haki Benita always has the best database blog posts <3


OOooo! I love it! I've been a PG user for... a very long time. The fact it supports inet datatypes just makes me swoon TBH.


Surprised to not see any mention of Postgres native pub/sub messaging: https://vijayg.com/postgresql-pubsub/

Is that feature considered well-known? Or is it so obscure this author didn't know about it?


I read that blog post but I don't quite understand it.

I assumed it would work as a regular pub/sub pattern where you get notified when some event happens. However, in the attached example, they still poll the database every half second. I'm not sure I understand the idea.


It's maybe not a great post, it was just one that came up when I searched.

> JDBC driver cannot receive asynchronous notifications.

Maybe their example is limited by the choice of client library.

This might be a better example? https://tapoueh.org/blog/2018/07/postgresql-listen-notify/

Or the offical postgres docs are:

https://www.postgresql.org/docs/14/sql-notify.html

https://www.postgresql.org/docs/14/sql-listen.html

https://www.postgresql.org/docs/14/libpq-notify.html


Thanks, it now makes sense.

Quite strange that this specific driver didn't support asynchronous notifications: if you have to poll the database anyway, there's no much difference between doing it without listen/notify support, I guess.


Avoid using lesser known features...

They're the ones that will be hardest to migrate to a different database, most likely to be deprecated, and least likely to be understood by the next engineer to fill your shoes.

While many of these are neat, good engineering practice is to make the simplest thing to get the job done.


Counterpoint:

Databases often outlive applications. Database features usually do what they do very quickly and reliably. Used well, a full-featured database can make things like in-place refactors or re-writes of your application layer far easier, make everyday operation much safer, as well as making it much safe & useful to allow multiple applications access to the same database, which can be very handy for all kinds of reasons.

> While many of these are neat, good engineering practice is to make the simplest thing to get the job done.

That, or, use the right tool for the job.


I'd rather have a non-portable SQL feature than an app-level "portable" implementation that almost certainly is slower and non-transactional.

Migrating to another DB will always take work, no matter how much you try to ignore flavor-specific features. Query planning, encoding and charsets, locking abilities tend to be very different. A query can run fine in MySQL and cause a deadlock in Postgres even though it's syntactically valid in both.


Even the most obscure Postgres features are documented better than whatever code a dev would write in order to avoid using them.


How often does migrating databases ever actually happen? I'm not saying it doesn't happen, but I've never experienced it or know anyone who has (I've asked!). I certainly would shy away from using features that make my app code cleaner and improves the all-around performance of the app on the slim-to-none chance that one day I might one day have to migrate off of postgres.


20-year career so far. Never seen a database swapped out. I've seen "apps" replaced on top of databases, or more programs added to access the same database. I've seen the app and database both get thrown out and replaced, because they were tightly coupled[0], as the parent advocates (Rails + ActiveRecord seems to be prime for this kind of "gotta throw it all out" situation). I've never seen the program stay the same while the DB is swapped out from under it.

[0] yes, that's actually tightly coupling them, because now your DB is too unsafe to use without the "app" you built on top, and doesn't provide enough functionality to make it worth trying to retrofit that safety onto it.


I've seen it firsthand, but we're talking about an early-stage startup moving from MySQL to Postgres. It's definitely not enough to convince me not to use database-specific features.


I’ve added an activrecord interface to an existing schema (on sql server no less). A lot of overriding of it’s assumptions and sometimes preventing writes via active record but it is doable.


20 years ago I ported an app from Microsoft SQL server to Oracle as a new requirement from the incoming CTO equivalent. I was told the whole place was standardizing on Oracle as they had in house expertise. Turns out they didn’t and while the port was completed and successful they had no end of trouble managing it.

I’ve since ported a few apps in early development or production between Microsoft SQL server, MySQL and Postgres (in various directions) but nothing in prod in over 10 years


I'm not sure why you are being downvoted. Your argument is reasonable and clearly presented, and "do the simplest thing that can work" is a tried and true design philosophy.


Using a built in database feature if often a lot simpler


Fair... and that's a good counter-argument, but not a good reason to downvote someone who disagrees. (Not suggesting that you downvoted him yourself.)


I just learned about OVERLAPS this week when reviewing a colleague’s code. We talked about the complexity of the conditions and found the overlaps operator and a ton of range operators. None of us knew they existed and they made our query a lot better.


Great article. Read through some of the other articles as well - some high quality content there!


The author's posts on Django are always worth reading. They're way more detailed than the run-of-the-mill rehash of the Django documentation that most blogs contain and there's always a nugget of "I never knew that" in them.


I love those kinds of posts. Always something that you keep for life.


I enjoy using Postgres's Network Data Types. I'm sure it wasn't that hard to implement them, but the ease of use is undeniable.


I found this blog a gold-mine of information. Is there any book recommendation that goes in the same direction?


The comment one is pretty neat, am going to start using it

db=# COMMENT ON TABLE sale IS 'Sales made in the system'; COMMENT


In-database comments combined with something like https://github.com/k1LoW/tbls make for very cheap database documentation.

No affiliation with tbls except that I'm a big fan


My favorite relatively obscure pg feature is you can write stored procedures in perl, python, and tcl.


Not to mention Java, R, Ruby, PHP, Scheme, and sh.

https://www.postgresql.org/docs/9.5/external-pl.html

Heck, there's even a pl/prolog out there, but it looks pretty old and I'm skeptical how useful it would be.

https://github.com/salva/plswipl


They're cool, but keep in mind that this often means you cannot run them on managed Postgres services.


Wow, I've been writing a lot of PL/pgSQL recently and did not know this.

Would these be drop in replacements for PL/pgSQL? Are there any performance tradeoffs to using one over another? Any changes in functionality (aka functions you can call)?


I personally have never run into whatever the limitations might be of pl/python, but I suppose you could make a disaster by trying to dynamically load conflicting symbols or something like that. But I have used numpy in stored procedures, so at least that works.

pl/sh also works wonderfully if you want to run a complex procedure in an isolated subprocess, but pl/sh is not part of the main postgresql distribution.


Great article, a lot of useful commands for those that work programming in the DB.


Beautifully presented, good advice.

Enjoyed and bookmarked.


> Grant Permissions on Specific Columns

This feature might be great for OLAP systems, but it's awful for OLTP systems. The last thing I want is debug permission errors bubbling from my database into the application layer. It may not be as efficient, but you should always manage your ACL rules in the application layer if you're building an OLTP app.


The word "always" should always be avoided in architecture discussions.


I qualified it by saying "if you're building an OLTP app" though.

I speak from personal experience. If you know a good usecase for ACL in database for OLTP workflows, I'm all ears.


> If you know a good usecase for ACL in database for OLTP workflows, I'm all ears.

Here's one that's meaningful to me. We have a single database with five different applications that access it, each of them managed by a separate team. By enforcing access constraints in the database we guarantee the access constraints will be applied in all cases. It is difficult to ensure that in application code managed by separate teams.

(Just to be clear: I don't think your advice is poor, I just wanted to give an example of a case where it isn't universally applicable.)


I see, thanks for your input. Most of the workflows I worked on involved provisioning one database per app, so I hadn't entertained this angle.

Are there any risks to changing ACL rules on a production database server from a stability perspective?


> Are there any risks to changing ACL rules on a production database server from a stability perspective?

i've never seen anything impact postgres stability. they test their code.

i've made DDL changes on live systems; that seems far more concerning than ACLs.


Even then, it would be better phrased as a tradeoff. What's the downside of putting in the database? I'd guess it makes horizontally scaling harder, and it's less portable to other database vendors. Is there an upside, like you've got stronger data security guarantees?


Agree with all points here.

An upside to using db to handle data permissions is the data has the same protections if you are accessing with direct SQL or using an app. Also, those permissions would persist with the data in restored backups.

I’m not advocating this and think for OLTP, building RBAC in the app layer is almost always a better idea, but these would be some benefits.


Good point. Phrasing it as a trade-off is the better approach. Agree with the downsides you listed, as well.

Better security guarantees, for sure, but if an unauthorized user gains direct access to your database, you might have bigger problems from a security perspective.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: