One of the less-known behaviour (I'm reluctant to say "features") is that you can have some sort of virtual field in a table, that will execute a function when the field is accessed. This is due to the logic that PostgreSQL treats `row.property` the same as `property(row)`.
For example, we can have the function `full_name(person)` that returns the concatenation of `person.first_name` and `person.last_name`, and we can do `SELECT p.full_name FROM person p`. I think it's pretty neat.
I had no idea about this (and I've been using the `property(row)` pattern for years), but sure enough it is documented at the bottom of the section https://www.postgresql.org/docs/10/static/rowtypes.html#ROWT.... Given that it doesn't really let you do anything new, I've got to wonder if it's worth the potential confusion to people who haven't been able to find this obscure part of the documentation and will lose hours upon hours trying to find out where this table gets this "foobar" column defined...
Well, usually the first step in investigating the database is to look at the table definition (either in the migration or in the dump your database can produce), that usually clears up such confusions.
Cool feature indeed, though keep in mind that for it to work you need to have the "virtual field" in the `search_path` of the session user.
In PostgREST, we take advantage of this behavior to generate queries without the need for extra code to differentiate between a field or "virtual field", we call these "computed columns" though https://postgrest.org/en/v5.0/api.html#computed-columns.
This is one of those things that seems like it would have limited value at first, but once you start to play with it you realize how powerful it actually is.
The first database software that I was paid to work on was UniVerse. At the time it was owned by, I believe, VMark Software, but it changed hands a few times and is now part of the Rocket U2[0] family. It has an equivalent feature, which was very heavily used, called I-Descriptors.
An I-Descriptor is an entry in a file's data dictionary that contained code to execute to calculate the value. You could generate a full name out of the firstname/lastname fields, perform conditional logic (e.g., return a specific address based on a preferred address field), etc. You could also call subroutines, where you had the full power of the BASIC language available.
One of the interesting things I did was created a field which would perform geolocation based on the postal code portion of the address. It would split out the postal code, check a cache file for the data, call a web service to retrieve the data and cache it (if it wasn't in the cache already, otherwise it would just return the cached value), and return the results. Everything needed was built in to the database, it was just a matter of coding it. The best part was that it became just another thing you could do in the query language - answering the question "list all active clients in this geographic area" became just another query.
One feature that's incredibly useful, but not on this list, is the JSON support[0]. Being able to store schemaless documents alongside your structured data is very handy in some instances. Being able to query the JSON documents in a useful manner is also incredible.
Not just being able to query them, but to be able to create useful indexes on the data at the same time still. This means you don't have to break out parts of the JSON into separate columns and maintain both a schema and a schemaless setup for them. The indexes aren't going to be as optimized as an index on a single column, but it makes the whole thing a lot more robust in the end than relying on your application or triggers to always keep mirrored columns up to date properly.
The most useful part of JSON support is making your query return data from multiple joins and sub-queries in one field. Then you can just json decode the result instead of fetching tons of rows. You could already do that with xmlagg but nowadays people like their JSON.
> SELECT a.id, JSONB_AGG(b.id) AS b_id FROM a, b GROUP BY a.id
Posts like this always make a little sad, because at work, we use Microsoft SQL Server[0], and for private programming, I usually prefer SQLite. Still, in some ways, Postgres feels like somebody invented a time machine and brought back a database engine from the future. Sigh Maybe I will find a good reason to use Postgres one of these days... ;-)
[0] Do not get me wrong - if you can ignore the price tag, MSSQL is a great database engine. It has never given me any trouble I have not asked for, and it has some very nice features of its own (e.g. the Service Broker).
On a relative scale with other DBMS's, Postgres is one of the most featureful, nicest-to-use databases. In reality, though, it's really terrible... It's just the least terrible.
One problem is that all those fancy features are released before they are done and never completed afterwards. Thus, Postgres is a mess of special cases that prevent things from working well together. Here are just a few examples I can list off the top of my head:
* You can't make arrays of domain types (domain types are a way to constrain the allowed values of some underlying more-primitive type)
* You can't parse JSON into a composite that has composites as any of its field types
* You can't do update..returning or insert...returning as a subquery and have to fall back to CTEs
* Composites are not allowed to contain instances of themselves within any of their fields
* Table inheritance doesn't work well with uniqueness constraints
Additionally, both the text and binary data exchange formats are completely unspecified with the binary format being completely undocumented and the text format being woefully underdocumented. The text format is also just really really messy and the results that you get back from the server are very irregular... It's pretty much impossible to write a Postgres driver that is anything other than a pile of hacks produced by reading the code and guessing. Pretty much none of them properly handle the more completed cases (most just punt on composites, for example).
Anyway: Postgres has a lot of wonderful tech in it that would take many, many years to duplicate and I would pick it from the field of actually-available RDBMSes every time but... it's still deeply terrible.
All that is to say: Postgres very much feels (to me) like tech from the awful, edge-casey past that is simply so useful and so hard to replicate that we are stuck with it!
I think RDBMSs will always feel slightly "special-case"-y when compared, as you seem to be doing, with general purpose programming languages - remember that SQL was never designed to be a turing complete language (insert someone showing it doing CTE tricks that make it so...). It instead exposes a number of primitives that a database is generally good at reasoning about and in return it can "see through" a lot of the queries the user writes and find efficient ways of performing them.
If you want a general purpose language as your database engine, you'll likely end up with something more like MongoDB where for any moderately complex query you'll end up hand-writing a single static query plan for everything.
This is probably why people seem less fussed about some of the asymmetries you
identify in RDBMSs. Different philosophies.
> Composites are not allowed to contain instances of themselves within any of their fields
Do you mean recursive type definitions? Or? Who does allow that? And what would you use it for in an SQL context? Please elaborate, I'm curious what you are getting at here.
> Table inheritance doesn't work well with uniqueness constraints
This got somewhat better in postgresql 11 (beta1 is available now) in that unique works for partitioned tables. I don't know if that also includes inherited tables, they are similar but not the same. Partitioned tables are the case most people are interested in anyway.
Will always have a soft spot in my heart for MSSQL, it's always been a relative pleasure to work with (even though I had to use Windows). I worked at the company that spawned MySpace, and after we were acquired by FOX I led a project to build a white-label social networking platform that could be re-used across FOX properties. We were getting 12-14k QPS back in 2008 on a single 32-bit quad-core AMD Opteron (1st gen) with a DAS (in a time when SANs ruled supreme). That was the first season of American Idol that the website didn't crash during broadcast.
SQL Server 2005 included SQLOS, bypassing the OS for direct memory and disk access that did wonders for performance. Parallel query execution in 2010 further cemented performance gains.
SQL 2017 is still the fastest mainstream RDBMS and now runs on Linux too, using that same PAL architecture to abstract the differences. Features like columnstores, in-memory tables and graph processing make it one of the most versatile systems today.
If only it had simple upserts, array types and proper json support, small details that make a big difference.
This has been my experience as well, though I haven't used it since SQL 2012. I dreadfully miss SQL Server Management Studio (SSMS), it is BY FAR the best IDE/GUI for a database I've ever come across.
We have found service broker to be one of the most brittle things. But I agree that the engine is really great. At the startup we use Postgres and as a former sql server DBA I love it. Auto sharding in 11 will be awesome.
" Auto sharding in 11 will be awesome." This with some form of multi-master without a single cordinator or something similar would be amazing. Too much to ask though
Table inheritance has some serious limitations regarding foreign keys and some other aspects. The following warning is in the official documentation in that chapter:
> These deficiencies will probably be fixed in some future release, but in the meantime considerable care is needed in deciding whether inheritance is useful for your application.
Here is the section on the limitations. Having tried this out myself, please be aware that this is not really 'inheritance'. I had to ditch it completely from my design:
For what I'm doing I got away with having insert/delete triggers on child tables that mirror their identity/type to a 'types' table. Also enforced through check constraints and FKs
Partitioned tables and child tables are different things; I'd be surprised if 2nd Q is actually is working on support for child tables. (Child tables I believe were historically used to implement partitioning, but Postgres now has native partitioning support.) There's no mention of FK support on child tables in the v11 release notes that I can find: https://www.postgresql.org/docs/11/static/release-11.html
Seeing this has made me realize that I have one HUGE bug with native Postgres full text search.
On my blog, I have a lot of articles about the game series Borderlands. If you type "Borderlands" into the search box, it will find them, but if you type "border lands", it won't. Same with "starcraft" and "star craft", etc.
It looks like I will have to implement trigrams on top of full text search to fix:
Take a look at [0]. I think you can solve your problem with a custom text search dictionary or possibly a thesaurus dictionary. They can substitute or canonicalize documents or search terms and look pretty easy to set up.
The downside with dictionaries is that they need to be continually updated depending on the underlying content. Trigrams seem to be more versatile and hands-off, and they can work with misspelled search terms.
Table inheritance is a non-starter for the stated use-case. Just don't. If memory serves me well even the Postgres docs recommended against using it for that last I read them. The only sane use-case of inheritance I'm aware of is when you're partitioning a table.
> Table inheritance is a non-starter for the stated use-case. Just don't.
Can you be more specific about why you shouldn't use it like this? I've used it for similar cases and it's seemed to work fine. You have to declare some foreign keys / constraints that you shouldn't really need to, but it's not really any harder than creating the "child" tables from scratch anyway.
I haven't programed in years, but insofar as I can recollect one issue was foreign keys. You basically cannot have a dependable foreign key on tables with child tables. Not a useful one anyway, because the parent and child tables are treated separately. The way foreign keys are (were?) built into Postgres is using built-in system level triggers for on delete/update events. Target the wrong table and all hell breaks loose.
I'm actually using table inheritance at the moment for a project and it works quite well.
I use the table inheritance to manage common fields in the database more easily instead of having to define a separate table and reference it
For example the base model, which includes fields for external objects the row might reference and certain option fields or a ratings table that allows an object to include all the necessary fields for rating a product. This way I can also copy a lot of code since the structs on the other side also can share a lot of code and data.
In Postgres 10, only range-type partition keys are supported, but PG11 will add hash-type partition keys, as well as a number of other partition improvements.
It is a balance always... but code maintainability isn't something to be dismissed. If utilizing CTEs greatly improves your readability it's probably worth utilizing CTEs until there is a performance need to switch off (and it's pretty trivial to de-CTE a query)
Not always. I use them to force hash joins. This way you can filter table A on an index, and filter table B on an index, then join them with a hash join. In a straight join postgres would try to use the same index for all of the joins. At least that was my experience. Maybe I'm not understanding the docs properly or using the correct terminology. I know that the end result was 10x faster than any index I could come up with.
I think the article intended to highlight features specific to PostgreSQL. CTEs are undeniably great, especially recursive CTEs, but they're also part of the SQL:1999 standard and available in lots of databases.
Some others more or less obscure features I used and liked:
- data checksuming (can be enabled in initdb)
- extending the database in C (mostly adding various functions, that would be hard and slow to implement in SQL, but I've also been able to write functions that generate datasets on the fly that are loaded from a custom server over a unix socket)
FDW is awesome and surprisingly useful. I really want Postgres 10's partitioning support to land in an AWS Redshift PG-reboot someday. Partitioning support is very manual and clunky in that old fork.
As far as I know partitioning doesn’t exist at all in Redshift. However, Redshift has sort keys which work even better than partitions for the intended workload. Partitioning would be nice for retention but I’m not sure what additional benefits it brings on top of sorting for a DW workload.
Is there any talks by AWS engineers why they went with a fork and not say a white box implementation of CitusDB or just use EnterpriseDB or just make it a vanilla Postgres with some more performant defaults??
Redshift is a massively parallel processing columnar storage database. The storage engine is completely different. Redshift is a derivative of ParAccel which was itself forked from Postgres and heavily modified. I assume AWS purchased it because it does things CitusDB and EnterpriseDB don’t do. The intended workloads are completely different.
Good luck. If I remember correctly, Redshift forked off of Postgresql 7.4, so it would be a miracle if they could bring it up to modern postgresql standards.
Anybody who's used the array types care to share their experience? It's something that caught my eye reading the docs, but wasn't sure how it holds up in actual use.
Very convenient, I often use them.
There's really no reason not to if your data structure is appropriate.
The only drawback for me is that you can't have arrays of foreign keys. (You could of course just store an array of the key values, but without the integrity checks).
Isn't an associative table the better option here? It would take a lot of the headaches out of dealing with arrays and constraints (ie, duplicates, limits on relationships) while keeping joins simple.
They are solid and easy to use and particularly good for early prototyping. I've found they work best for passive data, if you have attributes tied to a structure that are required by the application but bear little significance to the database then they're solid.
If your data is more complex then they tend to break down a bit, but at that point you should probably be looking at a joining table or other construct that supports foreign keys and such.
Nothing special about it: you have an array in your code and you store it as an array in the database without having to split it into records. I've been using it in a Django project for a model field of type array of strings. There is a django.contrib.postgres.fields.ArrayField for that.
A caveat: I used it only for small arrays (about 10 elements maximum.) I don't know how the ORM/driver/database combination would perform with thousands or more elements.
In some cases they make sense, tags being a good example. But the moment you find yourself writing extra code to do multiple updates in order to keep things in sync, that's a sign a normalised option might be a better approach.
I don't feel like tags are a particularly good example. For example, how do you efficiently get a list of all tags and how many X there are per tag? Easy for a relation (SELECT tags.name, COUNT(posts) FROM tags NATURAL JOIN posts GROUP BY tags.name), not so easy for an array (SELECT x.tag, COUNT(posts) FROM (SELECT DISTINCT tag FROM unnest(posts.tags)) AS x JOIN posts ON posts.tags @> [x.tag] GROUP BY x.tag).
The issue (iirc) with that sort of query is that it's essentially a join, so you'll get count(projects)*sum(length(regions)) intermediate rows that need to be grouped back together.
What do you want to know... SQL support is excellent including creating arrays during GROUP BY, unnesting them in SELECT, converting to strings and passing to/from user defined functions in JS/python/etc.
Arrays scale as you'd expect, i.e. even if it's allowed, I don't recommend 1MM values in an array because some/many of the aforementioned features aren't tested to scale. AFAIK array operations are single threaded and not vectorized or parallelized.
I tend to avoid them for storage, but I've found them to be really useful in committed queries. I CSN aggregate a bunch of values into an array on a single record and perform all sorts of useful calculations on them
I find it best to think of them as sets. A lot of operators treat them that way (e.g. containment), and you can use GIN indexing to accelerate access to them using those operators.
Table inheritance seems like a natural one for devs working with a lot of class oriented languages. But also, composition over inheritance seems to be a more frequent watchword. Any experiences using this one?
Also, triggers seem pretty controversial in the discussions I've been privy too, most devs seem to hate them. Any positive experiences with them?
I think many developers want to treat RDBMSs as simple black-box data stores... something, admittedly, you can largely get away with in the majority of applications. I think this leads to a lot of developers avoiding anything that forces a real understanding of the database since they can get a fair amount done without it and that's understandable. Triggers and stored procedures and database functions intrude into this solace and therefore are frowned upon by many.
Having said all that I think it's better to have a firm understanding and a feel for RDBMS operations and development techniques and that you can produce more solid work if you have that understanding. I think if you have that understanding (and actually have an appropriate problem for a RDBMS solution) there are roles for triggers/procedures/functions in the database... just understand where the boundaries are and delineate clear responsibilities to the various pieces of the stack (that's the hard part and the part in which many fail). I also think that if you are a developer needing to make use of an RDBMS and don't have a solid understanding of your RDBMS of choice, you're going to screw up more than just anything to do with triggers/functions/procedures/etc. This is why I always approach projects where ORMs are important with caution. ORMs are just a tool, true, but often times they are also the rug under which the database can be swept... or to put it another way, a framework for building technical debt. (Yeah, yeah, not always, but frequently).
As for PostgreSQL table inheritance... just say no. You cannot overcome the object/relational impedance with that feature and you end up compromising too much of why you might want a RDBMS in the first place. I've only seen it used well twice; one of those cases being table partitioning and that is, from a user perspective, becoming much less necessary with new features in PostgreSQL. PostgreSQL is an Object Relational Database Management System, but I think you have to consider the object part of that as primarily being useful inside the database itself: it can be useful in more sophisticated uses of the database itself, especially if you are writing functions/triggers, but isn't a great tool in my experience to map to OO systems outside of the database. I have found it much clearer from an information architectural perspective and easier to maintain data integrity using standard RDBMS modeling techniques (with the exception of maintaining certain unstructured data as JSON blobs).
I was one of the developers years ago. You find your language ORM library and away you go happy coding. It was only after running into some performance issues and other bottle necks that i put my head down and spent a week learning the DB (granted its not a long time but you can learn A LOT about SQL and your DB in 40 hours).
Not only did it solve my bottle neck issues it just made it very apparent that there are a lot of things ORM do that make everything harder. They have their place but working with the DB/SQL/Triggers/Views/Mat Views/ ... improved my applications greatly.
Now years later anytime i am working with a new developer i always stress that they at least learn the basics.
IMO, triggers should be used sparingly. For something like updating a "last changed" timestamp on a row, they can be useful. But they are places for side-effects to hide, and later be the source of frustration and performance problems. I almost always prefer to keep any business logic in stored procedures not triggers.
I generally agree with your statement, though I include how a particular piece of logic may effect data integrity.
For example, lets say I have an accounting system with a general ledger structured with a header table (row per journal entry) and a child detail table (row per GL Account with either debited or credited amount) and the idea that journal entries need to be "posted" to be considered part of the business transactional history. I may have a business rule that says: I can never post a journal entry that isn't balanced (debits must equal credits). It is always invalid to have a posted and unbalanced journal entry.
Assuming that the posting status is recorded in the header table, I may well write a trigger on the header table which, on updating the posted status of the header record to posted, checks that all of the detail lines sum up to a balanced journal entry and aborts the transaction if that rule is not met. This is a de facto encapsulation of business logic, but one important to data integrity. Sure I can also use a stored procedure, but I may want to have a trigger to be involved because of the difference between an active and passive application of the rule: a stored procedure for posting requires an explicit call... but just doing an UPDATE against the table won't call my stored procedure (or other application code for posting) whereas a trigger forces the issue. In many of the enterprise environments that I work in, the main accounting application is not the only way data gets into the database so I can't rely on those explicit calls to always be made to enforce the rules... the database is the final authority and forcing it there can also eliminate a class of issues. I may well have my trigger call my posting stored procedure if that procedure were written the right way...
Yes, I'm taking other issues like performance as granted as well, but I mostly wanted to show another aspect to consideration related to trigger use. [edit] Perhaps a more fine tuned approach is to say business logic which doesn't change data, but ensures rules should be met or the transaction is cancelled, helps with the side-effect issue you cite.
Thanks for your thoughtful comment on this. It’s interesting to hear good use cases for triggers.
Some years back I worked at a place that went deep on triggers and stored procedures and it totally put me off using either (sql server). There was a lot of opaque behaviour that was hard to debug and performance was terrible.
A friend is a great dB developer (one of the top on stack exchange) and when I asked him about triggers he said that he doesn’t use them for updating data, ever, unless it’s someone else’s totally broken schema and he has to dig himself out of a hole.
On the other hand, he loves stores procs and pretty much every interaction he lets apps have with the dB run through stored procs. What are your thoughts on how heavily you should lean on stored procs?
I find inheritance vs. composition to be much less an issue in RDBMSes vs. OOP languages. The problem with inheritance in OOP is that you inherit the code, but not the data (at least, it is supposed to be hidden). This results in overly tight coupling between your class and the other class that you just need a piece of and (I find) is almost never what you want (rather preferring composition and interfaces).
Whereas in an RDBMS, there is no "code" attached to a relation; and the data is explicitly exposed. Inheritance starts to look more like the typical pattern of structure with common fields at the top level, and disjoint fields inside a discriminated union. You can mirror this with RDBMS composition (foreign keys), but you lose the ability to enforce even basic structural constraints (tricks like mutually-referential foreign keys notwithstanding).
Constraint triggers are very useful for enforcing cross-table constraints.
Triggers themselves, beside the typical use cases of logging and notifying channels, I find useful for non-disruptive schema upgrades. You can redirect updates, or fill in non-trivial default values, using them. (Rules could probably fill the same role.)
> Triggers themselves, beside the typical use cases of logging and notifying channels, I find useful for non-disruptive schema upgrades. You can redirect updates, or fill in non-trivial default values, using them. (Rules could probably fill the same role.)
Before triggers are nice for constraints since they should not amplify writes. But even there they can become easily overlooked by application developers.
Inverted indices are the underpinning of all full-text search engines. There are different ways to do them, but ultimately, you do full-text indexing by loading individual terms in the content into and index that references the document, hence "inverted" from the normal sense of indexing the document to find the content.
It seems to me then that with a carefully thought out inverted index and using the trigrams mentioned above, you could go a long way towards having the capability of the Lucene-based implementations like Elasticsearch and Solr.
A lot of these things are neat. However, they are sharp tools with specific uses that you should be careful with and understand well before employing them.
We would (we don't even use an ORM but find we have some relations that would benefit from inheritance), but lack of FK support kills it. Instead we either break out common fields to a separate table (which makes it difficult to enforce row constraints), or put all fields in a single table and make them NULLable (sometimes with a constraint to ensure the NULLs match expected usage).
For example, we can have the function `full_name(person)` that returns the concatenation of `person.first_name` and `person.last_name`, and we can do `SELECT p.full_name FROM person p`. I think it's pretty neat.