Hacker News new | comments | ask | show | jobs | submit login
Postgres data types you should consider using (citusdata.com)
112 points by okket 5 months ago | hide | past | web | favorite | 37 comments

Another cool thing about Postgres is each table and view you make becomes a type.

Suppose you have a table of fruits:

  create table fruits (
      name text primary key,
      color text references colors,
      season text references seasons
(There are other tables at play, colors and seasons). Anyway, now if you make a function to find you some fruits in season:

  create function get_fruits_in_season(season text)
  returns fruits as
  'select * from fruits where season = $1'
  language sql;
Do you see that I just made a function whose return type is "fruits"? In this example, it's a bit overkill, because all you would really probably want is the name, and so would return "text." But it has been useful for me in real-life examples that are too complicated right now for me to spell out.

Documentation: https://www.postgresql.org/docs/current/static/sql-createfun...

Great tip, minor point, did you mean 'returns setof fruits'? Otherwise your function will only return one row. Opinions vary but i think its another good argument for singular named tables as well: 'returns fruit' and 'returns setof fruit' reads better to my eyes.

Just to emphasize the parent's point about singular vs plural, there are a lot of reasons to go with singular:

* Singular is much friendlier to find/replace, due to the inconsistency of pluralization.

* A good practice is to name association tables by concatenation of the tables being associated (when the association doesn't correspond to a previously named entity). teacher_class is a much better name than teachers_classes, and definitely avoid teacher_classes.

* As you move up the stack, structs, interfaces and classes have strong singular conventions. If you spend much time at higher levels, the habits you develop can lead to little bugs when working deeper in the stack.

Sorry, why yes, you're right!

  create function get_fruits_in_season(season text)
  returns setof fruits as
  'select * from fruits where season = $1'
  language sql;

This is the same as table-valued functions in MS SQL Server, right?

A nifty thing we use in our current project is the ltree extension[1]. It allows representing hierarchical relations and making hierarchical queries in a regular flat table structure.

[1] https://www.postgresql.org/docs/10/static/ltree.html

Love LTREE! Used it for domain label hierarchies (multiple roots - an "ontology") for a data science team to be able to train many different models with shared/overlapping label sets (an ensemble of models for various domains). It makes queries to get training data powerful since you can ask for "all descendants of X" or "ancestors of Y where Z".

Nifty indeed -- that may very well invalidate some old knowledge I have about manually representing trees in relational DBs, supported by recursive CTEs...

Didn't know that one. Powerful. Thanks!

Never heard of that, thanks.

The reason I don't use SQL enums is that I often can't tell exactly, how many and what values there will be when designing the database schema. Over a period of a few months an "enum" can grow from three to a dozen values, and writing an SQL migration each time is kind of a PITA, to say nothing about merge conflicts. In comparison, writing a validation on the application side is easy.

Am I missing something?

I have long wanted to use an enum but never have, despite using Postgres for more than a decade.

   > In comparison, writing a validation on the application side is easy
But then you went there. I was going to say, instead of enums I always end up just make a table of the choices, because invariably I will want to store one or more columns of information about the choice. Anyway, a foreign key, as opposed to an enum.

Writing validation on the application side is easy up front, but over the years I have been moving more and more smarts to the database:

- it's often less typing in SQL than your procedural language of choice

- some constraints are easy in a database but 1,000 times harder in a procedural middle layer (uniqueness, ACID transactions, etc.)

- it's more efficient, because it saves the transit of data that your application-layer validation often needs from the database to do its own thing anyway

I think you forgot one of the more important points against application-side validation: any future applications would have to also implement the validation or your data potentially corrupts. Applications that also include possible meatbags hand-modifying values for some "special" scenario (CEO order, emergency situation, etc)

The DB is a gigantic bag of data: if its got anything useful in it, then everyone will eventually want to put their grubby little hands all over it

In my experience, the key tenet for good database architecture is to assume that someone else is going to write a program, without your input, that's going to shove data into your DB's tables, without the involvement of your application layer. If the data that ends up in the database is wrong in such cases, then your application will be left holding the bag when it sees that data.

Much better to just tell your DBMS the rules you want enforced, such that when that ne'er-do-well (which might be, say, future you) attempts to insert crazy data, it just doesn't work.

Yes, it's a relational database so use relations to express things!

In my experience one of those "associated columns of information about the choice" is frequently an end-date, beyond which the choice is no longer available (so that old records can continue to be associated with old then-valid choices, but new records cannot).

When a database is just functioning as your (relatively monolithic) application's data store, validation on the app side works well. The app is the primary—maybe only—product.

But if you have multiple apps sharing the same data, or the app is a bunch of different things from many years glued together, etc., then there is no longer "the" application side. You'd have to write validation on multiple application sides, and make sure they all agree. The data is at this point often more important than a single app.

Even with single-app data stores, if you ever have to use the SQL prompt directly (say, to do a one-off change that it doesn't make sense to write an app feature for, or time doesn't permit doing so) it's nice to have all the constraints specified in the database.

100% agree with this here.

I have had to push for leveraging postgres for data validation (including typing columns with enums where applicable), and sometimes it's hard to convince people, but all you have to do is (a) poke around your existing data and find cases where you have junk because of some out-of-band process that created data without using your main app's validation, and (b) do a couple out-of-band things like an ETL or two and see how it saves you from creating junk.

I don't know where I picked this adage up, but it's also something I constantly think about:

Your data will outlive your application code.

IME, if you value the data's integrity, it's not really a question whether you should be pushing validations down to the db layer.

Some times you know a value won't change that frequently. We use them for things like AWS regions we support. Sure AWS may add a few new regions a year, even if we supported all regions though we'd only be at 18. The validation on the application side assumes you never go around that validation, and the reality of data is that it's very possible to do that without a constraint on the database. Of course you can go too far with them when they don't make sense, but they're at least worth considering.

Not saying that you are doing something wrong, but the AWS case kinda sounds like it will eventually grow to be a table of its own to me. Tomorrow the manager/architect/client comes and tells you that you need to associate some data with the region, and boom, you have a new table and a new foreign key constraint.

Again, not saying you're doing something wrong, just sharing a thought.

I don't have much experience using them but it seems to me like migrating from an enum to a one-column table would be generally trivial: the constraints are effectively the same between an FK and the enum, so there can't really be any data to be cleaned up, and you'd still only use it for data validation, so there's no queries that need rewriting afaict.

If its becoming a new table because you now want to associate metadata with it, then you'll have to rewrite queries in either case: previously it was just for data validation (so only relevant on insert, and never really referenced); now information is required so queries will have to join to it where they weren't before.

And if the enum existed in multiple places, and now you have to go around hunting them down to construct the FKs.. then it probably should have just been a table in the first place (otherwise any change to the enum requires finding and syncing each instance of it)

If it'll eventually want to become its own table, then at least while its still small we can claim the benefit of not having to name the thing. And that alone seems sufficiently beneficial to accept having to convert to a table later (unless, ofc, you expect to have to convert by like, tomorrow)

> Am I missing something?

Yes, you are using your company's practices to discount the general benefit of typed data stores. You are also trading safety and usefulness at use/read time for ease at write time which is almost never correct. If your schema mutates frequently, you need to cope. Fearing schema alterations, because of your migration setup is bad IMO (there are plenty of other reasons to fear them though of course). This is how schemas become stale and data analysts (or just other devs) have no clue what they're doing with a schema because developers short-sightedly refuse to improve/refactor schema because of business processes.

I struggled with this dilemma for a long time. Have tried three approaches.

In one approach, enum columns were integer columns in db & we relied on ActiveRecord's (RoR) enum definition to map string values their integer representation. This created issues for our (separate & independent) internal application which was feeding off the read-replica.

In second approach, enums were defined as db-types. This solved most of issues seen with the first approach, but again sql-migration for maintaining the enum-values was another pain-point. Though it was not as painful as i thought, as the enums didn't change very frequently.

In third approach, used a separate table to maintain enum definition. If the column has a wider presence then, additional join might prove relatively costly compared to second approach.

Currently, we follow a mix of last two approaches.

Uh, what part is hard?

It's literally 3 lines of work, if you use a tool that does most of the work for you. [Here](https://github.com/fake-name/ReadableWebProxy/blob/510cc41ae...) is a migration I have for one of my hobby projects.

Tthe process is

`alembic revision --autogenerate` Add two lines to the migration skeleton (4 if you care about downgrades). `alembic migrate +1`.

I may have the alembic migration commands slightly wrong. It's been a while since I've done them without having bash scrollback handy.

Additionally, adding an enum value cannot be run in a transaction.

And I use dbmate which puts all migrations in transactions.

For those (like me) who are wary of enums, it's very simple to create a tiny table with one column of distinct string values. Future-proof and serves the same function if you add not-null foreign key constraints to your dependent tables.

I may be mistaken, but I believe under the hood Postgres enums work in a very similar way to this, only the strings are interred instead of duplicated across your tables.

Yes, that's true. I'm paying the small duplication cost in exchange for removing enums' cumbersome limitations. Not saying I never use them, but I usually end up regretting it when I do :)

Personally, I'm on the fence with enum types... I've just seen too many times when someone thought an enumeration was final and then later it had to be modified or we needed alternative ways to reference it other than a plain string conversion.

Uh..... `ALTER TYPE enum ADD VALUE 'new_value';`?

The only potentially problematic thing is the fact that it's not transactional.

It’s easy enough to add items into enums.

But I agree - removing or renaming - safely - is an arse since you have to create a new enum, add it to every table with the old enum, update old to new column, drop enum, drop old columns, rename new columns to the original... yikes!

Yes it can at least be scripted but then all that schema rewriting is a pain on large tables.

Yeah, and you can't alter them in a transaction which makes things a bit tricky.

I've had misgivings also, but mine related to how the application would read in the enums versus a simple reference table the app can do a select once and prep the data or we could generate code for the same.

that's why I was surprised to see enum on that list - the rest made sense. my least favorite is when I've been called in because developers have assumed that enum's were mutable, and have started asking what all those error messages mean when they've tried to make changes.

HyperLogLog, also by citusdata (no affiliation), is a great data type for statistically approximating counts of distinct values very quickly within an adjustable range of error:


This has been a great secret weapon for me in a few big counting problems.

Ranges are interesting. I had an app once that dealt with biodiesel. Each gallon of biodiesel needs a serial number. I cannot remember all the rules, but you assign the numbers sequentially at the origin. So, a 10,000 gallon truck would have XXXXX00001 - XXXXX10000. When you offloaded into a bigger container it would have those numbers plus any ranges already in the tank. Loading would entail picking an amount from the ranges, reducing the range in the tank, and assigning it to the truck. We had routines for all this including consolidating ranges (two trucks dumping into a tank with sequential ranges would create a big range) or subtracting from ranges. A truck could end up with a collection of ranges. It was all arbitrary in a way and weird. I really hope they do it differently now.

I wonder how hard it would be to write that in PostgreSQL these days using the range types?

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact