Hacker News new | past | comments | ask | show | jobs | submit login
Common mistakes in PostgreSQL (postgresql.org)
1084 points by kawera on May 3, 2019 | hide | past | favorite | 253 comments

> varchar (without the (n)) or text are similar, but without the length limit. If you insert the same string into the three field types they will take up exactly the same amount of space, and you won't be able to measure any difference in performance.

Seriously ? everytime I create a django charfield I try to guess in my head what is the best length for this field to optimize for performance, but sounds like using unlimited length (textfield) field doesn't make any difference. Good news

Yeah, I used to do the same thing, I had dozens of limited VARCHAR fields and sometimes I forgot to change them when my application validation logic changed, and then we'd get errors from length limits being hit. It was a very liberating day when I realized I didn't need to have the length limits in database and that it made no performance impact, I went through and removed them all promptly. Life has been better since.

Does Postgresql have index size limits? I know in SQL server one thing that can bite you using varchar(max) is that if later you want to index it, it can fail if anything is over 900 bytes as that is the index size limit.

Its a run-time error too so the index may get created no problem then later someone tries to insert a >900 bytes value and it throws.

Even if it doesn't have index size limits, I can't imagine that it doesn't have row size limits and some kind of off-row storage or multi-page storage with resulting overhead.

Looking at the doc[0], I see this:

> PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows. This happens transparently to the user, with only small impact on most of the backend code.

So, yeah, large values are stored differently and there's an impact on performance.

Sorry, a blanket statement like "you should use varchar with no length limit" seems extremely ill advised to me. Like, sure, you don't want to use varchar(30) for last names. But varchar(255) is probably going to do you just fine.

[0]: https://www.postgresql.org/docs/current/storage-toast.html

IIRC you can still set the index, it's just that you can't do matching if the match/content is over the length and it can still do begins with matching.

That said, I still don't get why they didn't just change the behavior for TEXT/NTEXT vs adding (MAX) options.

edit: I am mistaken, when setting an index, the index max is now the insert limit (from response below).

If your referring to SQL Server you get on insert "Operation failed. The index entry of length 901 bytes for the index 'xyz' exceeds the maximum length of 900 bytes."

Thanks. It's been a long while since I've dealt with the situation.

roughly 2700 bytes, from a bit of googling.

(off the top of my head, the most likely use case for 2+kb text fields would be things you would be using pg's full text index on anyway, not the usual btree type indexes as you'd want to speed up a join or whatever.)

I agree, but sometimes its not something you think about and when every says just use varchar(max)/text then later on you want to speed it up with a btree, oops. IMO its better to set the length if it might be indexed to let everyone know ahead of time not to go over.

I am not that familiar with PG full text index types, but in SQL server the full text index doesn't really work well with exact/begins with searches like the b-tree so your kinda stuck in that situation.

With SQL server I like to stick with varchar(255) as it gives head room for unicode indexing and at one point there where some internal optimizations for less <=255 due to the legacy max being 255, not sure anymore.

Beyond 255 its much less likely to need a btree and becomes more of a full text issue.

Last time I looked it is limited to the page size (so about 4kb on the boxes I played with). If you try and put more than that into an indexed field it throws an error.

I did not know about this limitation in SQL Server, thanks!

Until a troll uploads the Bee Movie script as a first name.

Use `CHECK length(the_column) < some_maximum`. It has the benefit that you can change it without having to redefine views.

Woah that's my go-to, I had no idea I wasnt the only one doing this.

The Bee Movie is truly the lorem ipsum of tomorrow.

... or the Bee Movie as a base64 encoded string.

But every time there's a "b" in the output string, the lyrics to All Star are added in plaintext.

we call this beecryption


... of a scrypt

There are other advantages of setting (large) limits that are appropriate to your data model, at least if you take data from untrusted users. You'd be surprised by the kinds of things users will put into truly unlimited fields... But yeah, there's no reason to try to figure out whether 32 characters or 64 characters would be better. But a title field probably never needs to be bigger than a couple thousand characters, for instance.

That should be rejected at the API level already. It's possible to do that at the DB layer as well, but who's to say that there isn't another class of users that has higher limits.

To the first part: sure, if you're confident a single API service will be the only thing inserting into the DB, then you can skip the DB level limit. I think you may as well add it, since you've already put thought into what you want the API-level limit to be, so why not stick that same value in one more place.

To the second part: that way lies madness. It's true that you don't know all your future use cases, but it isn't a good reason to leave things unrestricted. It is much easier to increase the limit than to lower it (because you are likely to be deleting data if you lower a limit, which is a tough pill to swallow).

Totally agree, especially with larger companies there are integrations to ERP, data ware houses, partners etc. and then it's often a good idea to have communicated lengths on your domain data (title, email, ...).

In case more than one service writes to the same table, either both should be under control of the same organization, or a stored procedure layer should be introduced for the external org (there will probably be more things to check that just the limit). The phrase "untrusted users" is concerning in GP's post.

For users of older postgres deployments (<9.2) it's worth noting that changing the limit on varchar is not free, it triggers a table rewrite.

With Postgres, you can have code running on the user's machine access the database directly, either via postgrest, or even via an old-fashioned database driver. This is not a problem because Postgres has a very fine-grained rights management and supports row-level security, preventing the end user from accessing any data they are not authorized to.

Of course, you mention the stored procedure layer that is required to make this work, so you're probably already aware of this. In that case, remember that you might not need views over literally every table in the database, and even in the case where you do use a view or stored procedure, it might still be much more convenient to do the length limit check on the table rather than cluttering every stored procedure that writes to the table with the same length limit check.

Yeah but what it state here is that if you want to add a limit on field size, then add a check constraint... Much easier to deal with than changing type

You're right, that seems better.

Coming from other database systems I found this intriguing, and did some googling. Turns out there are historical reasons that PG has functionally overlapping text and varchar(n) datatypes, where text was the original still the best datatype as detailed in this thread, whereas varchar was added for ANSI compatibility, and does not enjoy the same system support (changing length is a pain compared to other engines).

PG is going to be the engine of choice for my next project, so I'm happy to learn about this quirk compared to other engines. Which doesn't mean I'm all that happy about it. I'm really in agreement with this post from 2010 (still current AFAICT) that wishes the support for varchar(n) would be better, for readability and cross-engine compatibility nothing else [0]

[0] https://www.postgresonline.com/journal/archives/154-In-Defen...

For contrast, MS SQL Server has `NVARCHAR(MAX)`. It's generally bugged me with MS-SQL that they added `NVARCHAR(MAX)` rather than just changing the behavior of `TEXT`

It's because text and ntext are extremely old LOB types, and they support SQL statements like WRITETEXT and UPDATETEXT that other statements don't. They also can't be indexed, don't support functions like LEFT(), REPLACE(), etc. Text and ntext behave wildly differently than char and varchar while varchar(max) and nvarchar(max) are almost the same. Changing that functionality would cause problems. Further, neither text nor ntext are ISO standard data types.

It's the same reason that SQL Server has datetime2 and Oracle has varchar2 (the latter of which is only now being merged into varchar). That's why float, real, and money still exist even though numeric/decimal was created. They behave differently and in ways that are not exactly backwards compatible. In order to prevent your customers from having to rewrite their application just because you updated your data types, they chose to add new data types.

This is one of my complaints with ORMs these days: they’re starting to feel a lot less sophisticated than the databases, and data, that they sit above.

The thing with an abstraction later is that it has to accommodate for the simplest component.

In this case (the no need to specify the len on your varchars) is an implementation detail of Postgres, but the ORM still has to deal with weaker components.

Also for those interested, there’s no difference between varchar and text in Postgres, I just use text fields now, since all length checking is done at the application level, and Postgres is the only DB I need to support

There are two ways to built abstractions: either by defining an interface that represents the lowest-common-denominator of functionality such that all the underlying implementations already implement it; or by defining the interface that you want to have, where some implementations can implement it directly, and others require a shim/polyfill implementation, or where the implementation “adapter” will just plain throw a NotImplementedError at your attempt to use the advanced features of the interface.

Personally, I prefer the latter. But it seems like in the ORM space the latter approach is nowhere to be found.

Have you ever seen an ORM that allows you to fluently deal with errors raised by triggers? DB-side computed columns [by storing to a table but querying from a related view]? Where the ORM’s “objects” expose an OOP interface to calling DB-side stored procedures that take that row-type as an input? (Or, even more crazily, where you could define a stored procedure in an abstract fluent AST against the object, and have it registered with the database? Or how about exposing DB-side BLOB handles as native-to-your-runtime IO streams, that could be fed into any runtime function that expects such?

There’s really a lot you could do, that’s still abstract to the SQL level (rather than specific to a particular database) if you were okay with some implementations of your interface not supporting every feature.

ORMs are great for quickly standing up a database, but if you need advanced features you should probably just use raw SQL. Most ORMs even make this very simple and safe.

Simple to say in a language with dynamic typing; but part of the reason I use an ORM (when I do, which isn't always) is to be able to guarantee that there is never a mismatch between DB types, and business-layer types.

Usually this means that I have a set of proxy types in the business layer that directly represent the DB types—table row types, table composite-primary-key types, view row types, stored procedure input-tuple and output-tuple types, etc.

Most ORMs support this sort of mirrored-schema modelling between the DB and business layer, with the ability to declare things like reference relationships where the ORM will throw an error if you get back a type you weren't expecting.

But when you drop to "raw SQL" in pretty much any ORM, you lose the ability to specify the business-layer-side input and output types for your expressions, and instead have to deal with sending and receiving the ORM's AST-like abstraction (usually plain arrays of scalars to map to bindings in the SQL statement.)

ORMs that provide a LINQ-like abstraction are a bit different, in that you can use SQL fragments within the fluent syntax to do things like call a stored procedure; and in those cases, they usually allow you to pass in typed arguments, and to specify the output type of the expression (for the rest of the fluent-syntax chain to use in interpreting it.)

While LINQ-like features are pretty nice, they still don't cover everything, because ORMs pretty much always execute their expressions within a transaction, and some SQL features (DDL statements specifically) can't be executed in a transaction. There are also certain types of SQL statements that most RDBMSes refuse to prepare with bindings—e.g. you can't make "which table you're querying from" a dynamic part of the query, even if you're willing to cast the result to a static "base" row type [as in PG's table inheritance.]

> Personally, I prefer the latter. But it seems like in the ORM space the latter approach is nowhere to be found.


Avoid like the plague.

At a risk of this being too far removed from the original subject, It'd be interesting to hear your reasoning behind this statement. I've used SQLAlchemy in some projects and did not find it troublesome, but that's just my experience, and I might've not pushed it enough.

We use it on a large monolithic application. It’s too brittle. It abstracts away the benefits of the underlying database all for the ability to support a handful of databases? Meh. The default queries or mode of constructing queries is so generic that it makes no sense to use: I miss the scalpel that is crafting my own SQL queries taking into account indexes and such running and tuning queries before committing them to code.

Look I was being hyperbolic but SQLAlchemy is great for prototyping but it’s then it’s too easy to just keep building on that crutch.

I hate them. They take out all the benefits of the database and add only cognitive load. I guess the case for them could be made in the same way that why would you write Assembly instead of C but to me an ORM is to the database like visual basic is to C

With Postgres, yes. Unless there's a real semantic need for a max length, you should always just use `text`.

Even then you should use text, just with a CHECK constraint. That's easier to change later than a varchar column.

Yes. The most complicated migrations I've ever done have involved increasing the length of varchar(n) columns, because every dependent view needed to be dropped and recreated (and all of their dependents, and so on, recursively) within the migration transaction, in the right sequence.

If I had just used the text datatype in the first place (along with a check constraint), these migrations would have been so much simpler.

You had views dependent on views? Object oriented database schemes sound problematic.

Nothing to do with object orientation. We just use a lot of views, both to abstract away common complex queries, and to allow external systems to interact with the data in a controlled way.

That makes sense actually. But stored procedures could do that too?

Yes. I tend to treat procedural code as a last resort, because pure declarative SQL is so much easier to troubleshoot.

I guess I use stored procedures differently. I rely on SQL queries not for loops if I can help it. Sometimes it can’t be avoided if you’re doing updates in chunks for example.

I think is super weird that databases can have developer-hostile designs like this and still be considered "world class". If a web all framework had pitfalls like these, people would rageblog them into oblivion. I'm not trying to single out postgres because I have the impression that the same holds for all databases over some age.

I mean, most of this article reads as a list of features postgres never should've had. If PHP can deprecate magic quotes, then why can't postgres do something similar? Why do the "string data type" docs contain anything other than the "text" type, outside a section titled "deprecated, do not use"? Why isn't there a strict mode that warns me when I'm trying to use Bad Ideas From The Nineties?

Every single technology that lives long enough has legacy features like these and antipatterns. When compared to other software of similar age, this is remarkable little cruft.

Every tool that's powerful enough needs to have caveats. I agree it's a PITA, but this is orders of magnitude better than say, bash, C++, Linux, or PHP.

Strict mode sounds like an awesome idea. From time to time people release database schema linting scripts, those might help?

They can't deprecate varchar and some other ill advised field types because they are required to comply with the SQL standard.

The text thing, though, has never been a secret. Learning about it is sort of a coming of age ritual for postgres developers.

Postgres' warts are nowhere as unpredictable and problematic as magic quotes. Apart from those things, most RDBMs are incredibly stable and ironed-out workhouses and are nowhere as controversial than PHP is.

Even better, make a domain type encapsulating the constraint.

I sometimes use a limit (constraint or varchar(n)) to enforce an upper bound when the domain doesn't really demand one, but you'd rather get an error that a longer string in your database. Say, 100 characters for a login name or 1000 characters for a real person's name.

Even if somebody actually shows up with a longer name, you know that your application will have other problems in that case (e.g. the UI explodes or something), but you prevent the occasional bug or other problem from writing megabytes into the name field, and causing follow-up problems.

So, does this mean there is no longer a good reason for fixed-width text fields in any databases? Why not?

Surely if it's variable-width, that means you can't guarantee it will be on the same page? And therefore there's a performance cost?

Indeed, that is the reason character fields have traditionally had a length in RDBMS. However, many of these optimizations get broken by unicode support.

Consider a CHAR(1) field. Without unicode support this field is always 1 byte, and therefore easy to store. With unicode support, however, the value that fits in this field is actually between 1-4 bytes. However, in many cases users will only store 1 byte in this field. Storing this field as a fixed length 4-byte field will in many cases result in a 2-4X increase in size and hence waste a lot of space. Thus even with CHAR(1) it is usually better to store the strings in a variable-length manner to avoid wasting space; and CHAR(1) is the absolute best case for this type of optimisation. This gets way worse with larger CHAR lengths.

It's also worth noting that even if we could store a fixed one byte that that is likely still not the best way of storing a CHAR(1) field. A CHAR(1) field is typically used to store types or categories. Typically, a column like this would have few unique values (~2-8 values), which means we can reduce the size of these values to 1-4 bits using dictionary compression instead of a single byte, resulting in a size savings of factor 2-8X over storing a fixed one byte length.

FWIW, I am guessing it's a rare case that you can be confident that all strings you'll want to store are of a specific length, but cannot predict the set of characters they may use.

The use cases for CHAR instead of VARCHAR that I'm thinking of would be things like serial numbers. CHAR(1) seems even more specific; the only time I've ever seen that used is for some version of enumerations with mnemonic keys.

> things like serial numbers

In Postgres, if I had a truly massive number of alphabetic-textual serial numbers (such that I was worried about their impact on storage/memory size), I’d define a new TYPE, similar to the UUID type, where such serial numbers would be parsed into regular uint64s (BIGINTs) on input, and generated back to their textual serial-number representation on output.

I don't think the problem is so much about predicting the encoding that will be used. Let's say you KNOW that non-ascii characters will be used in the field, but only very occasionally. That means the field has to be 4x as wide as it would have been otherwise even if non-ascii characters will almost never be used.

If you knew the characters were ASCII, you wouldn't want unicode support in the first place. It's pure overhead.

But you also wouldn't want char(n), because as the article notes:

char(n) doesn't reject values that are too short, it just silently pads them with spaces. So there's no actual benefit over using text with a constraint that checks for the exact length.

So even if you're using pure ASCII, you're supposedly better off just setting the character encoding appropriately and using a text or non-limited varchar field.

That said, I thought the main benefit of a char field (and its padding) was that you could use it in select instances where is was useful to have specific length records. E.g. if all the other fields of the table are fixed length, every record can be fixed length if you use a char(n). I'm not sure if that's a benefit or not to modern databases anymore though.

Well, it depends.

Some RDBMS's don't store the string's length with each value for fixed-width character fields. So a fixed-length field may still be preferable if the space lost to padding amounts to less than the space lost to overhead.

That distinction doesn't apply to PostgreSQL, because it stores a length along with the value even for fixed-with columns.

The problem is that even with serial numbers things aren't as fixed. You might have a different type of serial number, zip code, etc.. A check constraint can be expanded to do the appropriate length and formatting checks for most types. A strict length restriction just does one simple thing.

Same with name and address fields. I've seen databases where they set it at an seemingly arbitrary number, because that's the maximum length their delivery service could handle (UPS etc.), but then they added another one with different constraints…

The only performance cost is a worst-case cost. If your idea of performance is something like how many rows you can read/write/process per second, then the average length matters much more than the maximum observed length or the maximum permitted length.

As far as I know - it depends on how the optimizer/planner works as well (at least in SQL Server.)

If you have a known fixed width you can be certain about your memory grants required for a specific query (width * rows estimate) - if you don't know that ahead of time you might find you need to allocate more memory than is actually required.

This can be really bad for concurrency.

Postgres uses TOAST tables, an abstraction which allows PG rows to have unbounded lengths. Essentially, this works like a filesystem that supports extended attributes in both “small=intrusive” and “large=external extent” modes. When you have unbounded-length fields in your row-type definition, you end up with

• one “regular” PG table which has a row record type defined with a fixed length (like a filesystem inode), where all the static-length fields are held in the record itself, and any unbounded-length fields have a certain amount of space reserved for them in case they fit into that small space (if they’re even smaller, this space is just wasted);

• and one TOAST table—essentially a table of extents, that the unbounded-length fields of the previous record can, instead of holding data directly, hold pointers into.

TOAST for such columns only gets used when the column’s reserved size within its record gets overflowed; which I believe, for current PG versions, is 127 bytes. So, any row-type where all the fields are bounded to a length of at most 127 (either by literally using VARCHAR(127), or CHAR(127), or by using a TEXT column with a CHECK(length(col) < 128) constraint)—will never use TOAST, thus guaranteeing your stride width and memory bounds.

And, as well, any query that doesn’t try to “unwrap” the value of those unbounded-length columns, won’t have to join against the table’s TOAST table. So you can optimize for the memory bounds of specific queries on a table (by bounding the length of certain fields, keeping them out of TOAST) while leaving other fields as unbounded-length, if no hot-path query uses them.

(There’s also an extra optimization where values that are just barely too large to fit into the 127 bytes of reserved in-record space, are transparently compressed, to see if that’ll make them fit. Thus, although you need the length-check to guarantee that you won’t be using TOAST, it may turn out that your table “gets away with” having some values a bit larger than the bound without ever using TOAST.)

Excellent answer - thank you. I did some googling, and if you have any specific resources I would love to read more (I was just reading https://www.postgresql.org/docs/11/planner-optimizer.html)

A question - if the planner knows its joining to the TOAST table, and it has some predicate to evaulate, how does it know all the widths of all the rows so it doesnt over-grant(or are memory grants not a thing?)

Generally with statistics/histograms you get a sample and a max width (or something like that) - so I am just confused how it can be precise.


After many days of googling, it looks like postgres doesnt have this problem because it also cant use dynamic memory grants to get better performance on a per-query basis like SQL Server does.

So while you never suffer from "under" resourced queries, you generally dont get to see what they would look like if they had more memory.

In the SQL Server world that memory grant is part of the query plan, so that's why its a question that occurred to me.

Happily, the PG planner isn't thrown off by such things, at least in my experience. It has sample rows and uses them. (I really like the postgres planner, it's the best I've used.)

I've seen it be FAR wrong about memory, though, despite being right about column widths: When you join a few tables in a big select, the planner will look at some sample rows and guess the number of rows in each step of the possible plan, and if the sample is too small, that estimate can be wrong by a LARGE factor.

Use a CHECK statement to limit the length. Changing varchar length doesn't play well with dependent views.

>Seriously ? everytime I create a django charfield I try to guess in my head what is the best length for this field to optimize for performance, but sounds like using unlimited length (textfield) field doesn't make any difference. Good news

Pretty much, except if you're using ModelForms you'll end up rendering a different element by default, and the admin (if you use it) will be different. Not really a problem with DRF without the browsable API.

You can change the element for a field in widgets in the Meta class of a form.

That's strange. In SQL Server, the size of the text columns impacts performance [0] by changing the amount of memory that is allocated to run queries.

[0] - https://sqlperformance.com/2017/06/sql-plan/performance-myth...

As the article points out, that's not normally the case in SQL Server, it's that there are certain steps like sorting that require blocks of preallocated memory.

SQL Server is probably like that because varchar(max) is relatively recent; I'd expect as their support for it improves that issue will go away.

> I try to guess in my head what is the best length for this field to optimize for performance, but sounds like using unlimited length (textfield) field doesn't make any difference.

That used to be a good idea. I guess it's fallen the way of defragging, or "parking" your hard drive.

Charfield in django is varchar so while there is no performance difference vs textfield, it does put limits on your inputs. This is most likely a good security measure. Unbounded inputs can be found and exploited.

I’m pretty sure Django doesn’t actually create a limited char field that way, it merely adds validation. For exactly this reason.

It does pass the limit down. I've just verified my schema that has a varchar(128).

One of the reasons given in the bug for it being difficult to fix is down in the depths it assumes it is a varchar() -- with parentheses.

Here's the relevant comment:


I wonder if the same holds true for SQL Server? Data truncation errors bite us in the ass all the time where I'm at.

> If what you really need is a text field with an length limit then varchar(n) is great, but if you pick an arbitrary length and choose varchar(20) for a surname field you're risking production errors in the future when Hubert Blaine Wolfe­schlegel­stein­hausen­berger­dorff signs up for your service.



UI devs HATE him!

I'd have Hubert over Bobby Tables any day!

Funny how an article popped up on HN last week that mentioned table inheritance in a list of obscure Postgres features you should try out.

If I'd read this first, I wouldn't have wasted Saturday finding out for myself why you shouldn't use it.

... or if you had read the HN comments on that article: https://news.ycombinator.com/item?id=19768864 :-)

If they supported foreign keys, I’m pretty sure I’d have used it many times. It seems like a good way to efficiently represent multiple distinct types which have some shared columns and some non-shared columns.

I recently used table inheritance to great effect on a project, so I wouldn't write it off completely. My use-case was a bit different as it involved zero-downtime data imports from separate sources (concurrently), but it proved a great tool for the job.

I find how it's dismissed as a fad pretty humorous and actually a bit insightful.

>Table inheritance was a part of a fad wherein the database was closely coupled to object-oriented code. It turned out that coupling things that closely didn't actually produce the desired results.

"Don't use CURRENT_TIME. Never"

Honest question. If we should never use it, why have it in the first place as an option ? Is it some type of backward compatibility issue ? This list is great btw. Why we shouldn't do something is a great way to learn stuff.

They give the reason on the previous entry (about timetz):

> Why not?

> Even the manual tells you it's only implemented for SQL compliance.

lol, can't they implement a strict mode?

Do they? This doesn’t sound like a bad idea. Strict mode disabled = compatibility features on; strict mode enabled = only ‘safe’ options enabled.

I don't know, but somehow HN hates this idea?!

I suspect you're being downvoted not because of your idea, but because of how you're presenting it. Comments that contain "lol" and a sarcastic remark or rhetorical question are considered immature and aren't well accepted here.

It explains why, because it returns timetz which is only kept for compatibility with the SQL specification but is a bad way to store time.

I have the same question for things such as

> Don't use psql -W or psql --password

If I do a psql --help I see

  Connection options:
    -W, --password  force password prompt (should happen automatically)

Nothing tells me I shouldn’t use it.

The documentation even states:

> This option is never essential, since psql will automatically prompt for a password if the server demands password authentication. However, psql will waste a connection attempt finding out that the server wants a password. In some cases it is worth typing -W to avoid the extra connection attempt.

So this is definitely confusing.

What they probably mean is that you shouldn't do something like --password=mypassword, which will work but is obviously dangerous to do.

No, they say: "If you're connecting with -W to a server configured to allow you access via peer authentication you may think that it's requiring a password when it really isn't. And if the user you're logging in as doesn't have a password set or you enter the wrong password at the prompt you'll still be logged in and think you have the right password - but you won't be able to log in from other clients (that connect via localhost) or when logged in as other users."

Don't do that either, but -W will always ask for a password even when one is not required. Got a keylogger or someone watching over your shoulder, and they now have your password even if the server is not actually asking for it. All that insecurity just to, potentially, save a roundtrip to the server.

> Got a keylogger or someone watching over your shoulder, and they now have your password even if the server is not actually asking for it.

If this happens you've lost already, -W or not.

Why would I ever not want the server to require a password, though? Like, if your server ain't asking for a password at all, then why even bother with the keylogger?

Peer authentication for local postgres servers / poolers (authenticate using the OS users, doesn't work remotely, and yes, it's safe), kerberos / [gs]sspi authentication, client certificates.

Ah, right, forgot about those options.

Still, if you know that you're connecting to a server that's setup with password-based auth, then what would be the point of waiting? You'll inevitably have to provide a password as some point anyway, so the keylogger argument seems silly.

And if someone has the ability to install a keylogger on a machine that uses peer or certificate auth, then that someone almost certainly already has the ability to just connect directly using that same peer or certificate auth.

Postgresql supports authentication via TLS client certificate, which requires no password. Even then, a keylogger might be able to catch other secrets.

To me, "should happen automatically" means "program will take care of this, specify this option only when you suspect it is broken or non-optimal"

.. which is what original doc says

Postgres could give a warning message for this (and other no-no's)

Money is always funny to represent. I did a bunch of research on this for my budgeting app. Turns out that while most countries use two decimal points for their cent/cent equivalent, some use 0, 1, or 3. In addition, at least one country (Sweden I think), doesn't allow values like 0.01. Their smallest denomination is 0.05, and everything is a multiple of it. I ended up storing the data as a decimal type with 3 digits after the decimal, along with the currency. Then I periodically extract the locale data from a Linux timezones package and the app has a giant lookup table for how to represent and parse each currency + value combo. This changes from time to time as currencies get updated and governments change.

> In addition, at least one country (Sweden I think), doesn't allow values like 0.01.

There are a few countries where they no longer bother with small coins to represent .01, so cash transactions are rounded to nearest .05 or similar, but the .01 is still valid for electronic transactions. I expect that is the situation in Sweden. (I wish US did this too)

Canada did this. No more $0.01 coins, $0.05 is the smallest denomination.

Electronic transactions, which account for over 50% of transactions in Canada, still use pennies though.

India got rid of 0.1, 0.25 and 0.5. Cash transactions are rounded to the rupee.

And Australia. We got rid of 1c and 2c coins back in the 90s.

But electronic transactions are still processed to the cent.

The Haskell library safe-money does a great job of handling these weird cases for currencies. It has a separation between the types used for calculations and storage/presentation, and lets you define arbitrary ones so in your example you can handle computations in an arbitrary precision rational representation and then store it in cents and later convert with proper rounding to 5 cent increments if necessary. Much of this is handled by the type system, including type safe currency conversion.

All electronic transactions handle the SEK cent (1/100 SEK aka öre) though there are no longer any valid SEK cent coins. The 50 cent coin was taken out of circulation September 30th 2010 and nowadays all remaining cent coins are considered scrap metal.

There's (sadly) no rule that everything must be a denomination of 0.05 SEK or even 1 SEK. Pretty much every shop continues to use cents in their pricing (like .50, .95, and .99). If you pay in cash you'll end up rounding to nearest SEK (rounding up if it's below 1 SEK). You could potentially save money by paying with cash if your transaction would be rounded down... but at a maximum of 0.49 SEK (~0.05 USD) saved per transaction you'd need some unique spending habits to actually utilize this "trick" efficiently.

Source: https://www.riksbank.se/en-gb/payments--cash/notes--coins/co... and I happen to be Swedish.

As someone who does a lot of work in payroll (in eastern europe...) the swedish system sounds like a dream.


Switzerland only has 0.05 as a minimum denomination too, might be the one you were thinking about.

I don't get the rules on "timestamptz".

Normally I would want to enforce the use of UTC for points in time (as opposed to user entry) across backend code and APIs. If the column name is "eventtime"... then it is (of course) UTC...

I guess one could use timestamptz with a constraint saying that timezone is UTC, but is there a point? Timestamps in DBs should never be non-UTC...

timestamptz does store the timestamps in UTC. The difference between timestamptz and timestamp is that the former understands that time value represents a particular instant in time, whereas timestamp does not, the value of a timestamp field is essentially opaque because those values can be in any timezone and the system has no way of knowing.

    [postgres] # create table test(ts timestamp);
    Time: 29.807 ms

    [postgres] # insert into test values('2019-05-03 17:51:00');
    INSERT 0 1
    Time: 6.637 ms

    [postgres] # select * from test;
     2019-05-03 17:51:00
    (1 row)

    [postgres] # select ts at time zone 'utc' from test;
     2019-05-03 20:51:00+03
    (1 row)
    Time: 17.506 ms

    [postgres] # select cast(ts as timestamptz)  from test;
     2019-05-03 17:51:00+03
    (1 row)
    Time: 0.428 ms

    [postgres] # select cast(ts as timestamptz) at time zone 'utc' from test;
     2019-05-03 14:51:00
    (1 row)
    Time: 0.329 ms

    [postgres] # alter table test alter column ts type timestamptz;
    Time: 14.477 ms

    [postgres] # select * from test;
     2019-05-03 17:51:00+03
    (1 row)
    Time: 0.399 ms

    [postgres] # set timezone = 'America/Los_Angeles';
    Time: 9.281 ms
    [postgres] # select * from test;
     2019-05-03 07:51:00-07
    (1 row)
    Time: 0.364 ms
All that said, I think it's unfortunate that timestamptz stores values as UTC and I think it would be better for it to store the actual timezone for reasons others have already mentioned in this thread.

about "never not UTC" (but otherwise unrelated to the OP):

There was an article recently, describing a very important, and also very common, "exception":

If you want to schedule/synchronize real-world appointments/events, you need both the _local time_ and the political zone, like 2018-12-11 9am at Europe/London.

The reason is that between time of entry and time of event, the defined timezone for that area may change. As a result, your previously translated appointment would not match up the local time.

Which is bad, because humans, when meeting, or events, when being announced, announce the local time. Not UTC. And they stick to the local time.

I remember that post. It's not a remote chance. Every EU country will have to choose whether to stay forever in DST or forever in standard time, probably in 2021. Some of them will likely change time zone as a result.

Interesting. Having trouble with one minor point. When a local datetime (in a known timezone) is converted to UTC, won't it know the proper DST setting for that datetime? Seems like it should still get the right answer, with the even smaller exception that a law might be passed to change DST in the meantime.

I know and that is why I said "as opposed to user entry"

My understanding is that timestamptz basically does what you want but better. It takes a timestamp in the timezone set in the client, and automatically converts it to UTC for storage, and any other client gets it in their local time based on the value their timezone is set to on their connection or session.

It basically lets you fold all the logic of converting datetimes into UTC and back out of UTC to local time for display into the DB itself. Worst case, if you want UTC and to do it in the application logic, just set your connection timezone to UTC, and you'll be expected to provide and get back UTC times.

> I guess one could use timestamptz with a constraint saying that timezone is UTC, but is there a point?

There's no need for constraints, that's exactly what timestamptz does. If it's provided a zoned datetime it'll convert that to UTC then store it, if provided a non-zoned datetime it'll assume that's in the local tz ("timezone" system setting) and performs the same conversion.

timestamp, by comparison, will just discard any timezone information (if it's given a zoned timestamp it'll simply discard the timezone information and store the timestamp's value as-is).

> Timestamps in DBs should never be non-UTC…

That only works for past events. Storing future events in UTC is usually broken, because future events are normally relative to a specific timezone (e.g. a meeting at 6 in Tusla, or a party at 8 in Melbourne). If you store them in UTC and the timezone's offsets change your occurrence is now incorrect.

> Timestamps in DBs should never be non-UTC

I think you might want to consider why some people might actually care to store timezone data, and why it almost always is needed. Timezone data is separate information that if not included, is now missing from your data forever unless someone goes and backfills it in. Without the explicit timezone stored, your timestamp could technically be in any timezone as far as the database is concerned.

> I guess one could use timestamptz with a constraint saying that timezone is UTC, but is there a point?

The point is you now have a data type that can properly use the built-in functions to more easily run calculations against proper timestamps that do have timezone sensitive calculations.

> Timezone data is separate information that if not included, is now missing from your data forever unless someone goes and backfills it in. Without the explicit timezone stored, your timestamp could technically be in any timezone as far as the database is concerned.

As the article points out, confusingly, problematically, and despite its name, "TIMESTAMP WITH TIMEZONE" doesn't, in fact, store timezones.

What it does is implicitly convert between UTC (the storage) and zoned timestamps (anything you give it, either explicitly zoned or — if non-zoned – assumed to be in the server's "timezone" setting).

Thankfully TIMESTAMP WITHOUT TIMEZONE doesn't store timezones either (how weird would that be), what it does is discard timezone information entirely on input, storing only the timestamp as-is: if you give it '1932-11-02 10:34:05+08' it's going to store '1932-11-02 10:34:05' having just stripped out the timezone information without doing any adjustment whereas timestamptz will store '1932-11-02 02:34:05', without maintaining any offset either but having adjusted to UTC:

    # select '1932-11-02 10:34:05+08'::timestamp, '1932-11-02 10:34:05+08'::timestamptz;
          timestamp      |      timestamptz       
     1932-11-02 10:34:05 | 1932-11-02 02:34:05+00
    (1 row)
    Time: 0.466 ms
Postgres has no built-in datatype to store an offset timestamp, let alone a properly zoned one. In fact, it has no datatype to store a timezone at all. Though it does provide pg_timezone_names.

Great resource! I have used PostgeSQL since, well forever. And I didn’t know the advice on text storage. Shame on me, but now I know.

Agreed! The ORM I'm currently using has been setting text for all my `string` columns by default and I was feeling the need to go fix a few, so this is one of those magical times when procrastinating on an issue paid off in my favour.

On that note though, I wonder what impact (if any) this has on an index covering the column? I could be off here but from my mysql days I was under the impression the index would be more efficient if there was a length limit on the column.

I feel like every programming-related application, library, etc. should have a page like this.

Better approach would be using warnings.

Agree with you AND the op. I think it's really nice to receive warnings but we also need a corresponding - don't do this, but do this instead. Developers want to follow best practices...we just need to make it easier across the board to figure out what the better way to do that thing is. I like Airbnb's linting guide which shows don't and do examples for each rule.

Warnings (necessary and useful for quicker feedback loops) only tell you not do something.

I'm surprised there's nothing in there about JSON; their page on json/jsonb [0] seems pretty clear that json has been all-but-superceded by jsonb.

When would json be preferable to jsonb? I can imagine cases where write-speed is essential and fancier queries and indices aren't needed, but that seems rare.

[0] https://www.postgresql.org/docs/current/datatype-json.html

Duplicate key names are supported in 'json', formatting preserved. jsonb won't allow that and reformats.

If you just need to store JSON, it is better to use json rather than jsonb because it will be compressed just like text. If you need to query it, jsonb is better. I have several use cases where I just need to store JSON.

We recently migrated a large jsonb column to json because, iirc, both the write and the read speeds were much better when there was no (de)serialization involved.

My current understanding is that json is faster and bigger than jsonb.

> My current understanding is that json is faster and bigger than jsonb.

FYI, it's the other way around. When you use the json type, Postgres validates it and stores the raw string. Operations on json are very slow. Whenever you extract a field from a json column, Postgres has to parse the full json.

On the other hand, jsonb is stored in a custom binary format. This makes operations a lot faster as Postgres does not have to parse the full json. There's also some other advantages such as jsonb stores the keys in sorted order making it possible to do a binary search to find a specific key in a lookup.

There are two primary downsides of jsonb. First, it doesn't preserve whitespace. This can be a problem for some use cases. Second, since there is a lot of metadata in jsonb, jsonb is larger. Postgres stores information like the offsets of every key and value.

Good point. We don't extract individual fields or query on them. We just schlep the data in and out, and for that, json appears to be faster.

In that case, couldn't you just use text?

If you need to preserve formatting, maybe.

It's a wiki right, so you can simply add it

So what would be wrong with the following announcement:

"In three years, varchar no longer takes an argument, char and money do no longer exist, psql's password flag will be removed, all tables will be lowercased, BETWEEN will require fully qualified values, table inheritance will be disabled unless you specify a config value, and this page will be shortened to only include NOT IN and SQL_ASCII. In one year, a lot of these things will raise warnings. In two years, it will raise suppressible errors."

(I left out a few for brevity, but only a few -- this would actually resolve most of this page.)

Most of this strikes me as common things (varchar and psql --password in particular), and aside from NOT IN there appear to be easy solutions to make people stop using it without big changes. It's not like python3 where you have to change a core thing like text handling, it's removing a flag from your cronjobs (--password), omitting a value (varchar(n)), or telling the database what you really meant (BETWEEN X AND '2019-01-01' -> '2019-01-01T23:59:59.999'). For stuff without obvious fixes, a "do not do this" page is much more reasonable.

I'm sure that there are far too many use cases of varchar(255) out there for anybody to deprecate such a feature without causing huge pain to everybody involved.

Just look at the python 2 to python 3 transition, which is still ongoing, more than 10 years after the python 3.0 release.

Also, warnings usually simply don't work. Developers or admins google the warning, see that it can be switched off, and switch it off. DDL is often generated (for example for migrations), and you might not want to add a special case for postgres if you can avoid it. Suppressing a warning is definitively easier than changing code generation in potentially non-compatible ways.

Plus, I haven't looked it up, but given how prevalent VARCHAR(N) is, it might be part of the SQL standard.

What's wrong with it is this is a database and not a JavaScript framework flavor of the week.

Is it safe to assume most/all of these best practices are advisable across all flavors of SQL? I can imagine it getting overwhelming when interviewing for jobs that nitpick specific code using specific implementations of SQL

No, many of these are specific to Postgres. I don't know of any other database that has Postgres's rewrite system called Rules. Also other databases handle mixed-case identifiers differently. PostgreSQL is one of the databases that adheres the SQL standard most closely, which makes you think that many of these things would apply to other databases. But that's just it. Databases follow the standard willy nilly, so that may be a reason that in fact Postgres is different from them.

And yet, yes! The best practice, obviously, is to always use Postgres. So --- what other flavors of SQL?

More seriously, I have read that SQLite is often similar to Postgres. Its main author actually has said in a talk that when making decisions he asks, "What would Postgres do?" Also I have heard that Postgres is similar to Oracle, so much that you could move from Oracle to Postgres with a thin layer of compatibility. In fact I think that's what https://www.enterprisedb.com/ is.

> PostgreSQL is one of the databases that adheres the SQL standard most closely...

Minor nit, Pg doesn't have SQL standard stored procedures; a.k.a. SQL/PSM. Mysql and DB2 do however.

That is the command to create one, but the syntax of the definition body is what does not comply with SQL/PSM.

Looks like there is an extension but it hasn't been updated in 8 years as doesn't appear to be production ready yet[0]

[0] https://github.com/okbob/plpsm0

The performance characteristics of `char` vs. `varchar` vs. `text` certainly depends on the implementation details of each RDBMS. PostgreSQL recommends that you use `text` for basically every textual column, but other databases can be very different. Some won't even allow you to add an index to a `text` column.

The various timestamp-related data types are also very PostgreSQL-specific.

As a SQL Server developer, a lot of the advice here applies to SQL Server as well (or at least advice that isn't specific to PG tools, such as "Don't use psql -W"). However, the biggest issue with applying these practices to SQL Server is that terms that seem similar may in fact be radically different. For example, the `timestamp` data type in SQL Server isn't at all what you'd think, and `text` is an old deprecated type with poor performance.

So I think there's some good high-level advice here that applies across multiple flavors, but the specifics are really only applicable to PostgreSQL.

Oracle treats NULLs very differently than most (all?) other RDS. So, for example, the Postgres best practice about NOT IN doesn't apply to Oracle.

Not necessarily. Different RDBMSs have different implementations so what isn’t bad in one might be comparatively bad in another.

> Is it safe to assume most/all of these best practices are advisable across all flavors of SQL?

No, anything like this is pretty implementation specific.

In "Table Inheritance" the article is referring to a postgres specific feature which implements a form of the data pattern by the same name. The pattern is not unique to circumstances where you are working with an ORM. So if your data is conveniently modelled using such a pattern by all means use it, in whatever DBMS you are using, but perhaps not with this postgres specific feature.

In "Not In" the fragment "does not optimize very well" is dependent on the query planner. Other DBMS engines may not optimise this well, but you might want to run performance tests on realistic data to check. The possible confusion when the list contains one or more NULLs is present in anything standard because that is a property of NULL handling.

The comments about case is implementation specific too. In SQL Server for instance it depends upon the collation set for the database, with names being consistently case-sensitive or consistently not (usually the latter) depending on the collation setting and irrespective of any name escaping.

The point about using BETWEEN with date+time values is correct generally, and I have seen it cause confusion a number of times. Though I've seen similar confusion with other comparisons based on date for date+time types too so this isn't specific to BETWEEN.

The comments about time types with timezone information are general to databases that support such types (though not all common DBs do). My advice would instead be to always store times in UTC and translate to the local timezone elsewhere (though this can have its own problems so others may disagree).

String data types and their performance characteristics vary a lot between databases. In SQL Server VARCHAR(MAX) value (the equivalent to postgres' VARCHAR-with-no-length or TEXT) will usually be stored off-page which can have negative performance implications, and can't be indexed, so you wan to use limited length text columns unless you definitely need long values. What SQL Server calls TEXT is essentially a BLOB type and long-since deprecated anyway.

> I can imagine it getting overwhelming when interviewing for jobs that nitpick specific code using specific implementations of SQL

If the job is specifically working with that flavour of database then this sort of detail might come up, there is little-to-nothing you as an interviewee can do about that other than swot-up beforehand if you are used to a different engine. Though such detailed nitpicking probably doesn't belong in a good interview process anyway, at least not until late stages when there are too many otherwise excellent candidates and they need something/anything to differentiate between them to make the choice easier!

If you are interviewing for a job that involves coding for multiple SQL database flavours then you will need to at least appreciate that the characteristics and behaviours of data types vary between engines, even if you don't know all the specifics off the top of your head. Other oddities you might come across include Oracle storing empty strings as NULL (with the possible problems at retrieval time this implies), SQL server ignoring trailing spaces in strings most of the time, and so forth.

Despite the recommendation timestamptz is not actually good for timezones. I think it converts to UTC but doesn't actually store the time zone!!!

That's fine for, eg, recording a timestamp, but doesn't work for a scheduled event. What if you want to be able to create a new event for "same time next week"? Impossible without having the time zone.

I had an app where it mattered so we stored the datetime with a timestamp (as UTC) and had a separate column for timezone.

I don't really see the point of timestamptz. In most cases when you're just storing a moment in time, convert it on the frontend and store it as UTC. If the time zone actually matters (scheduled, future events) then timestamptz doesn't cut it, and you need to store the timezone separately anyway, so again might as well just use UTC timestamp again.

Eh, I guess if your programming language doesn't have good time zone support you could lean on postgres for the conversion to UTC, but in most apps you'll have to deal with time and zones on the frontend anyway.

I disagree. Perhaps I don’t understand your complaint fully.

Timestamptz column values are the UTC timestamp, but the returned time zone aware result will depend on your db settings, or even local time of your client. Which i think is what everyone would want.

Storing an extra column time zone with a timestamp is just an inconvenient way to implement timestamptz.

Agreed. Timestamptz is not good advice and the database storing some different than what you send it is always a source of problems.

Standard timestamp datatype always stored in UTC or some specific timezone in another column is the best way to deal with dates.

re table inheritance, don't confuse SQL Table Inheritance with PostgreSQL Table Inheritance. They are different things. SQL Table Inheritance is a good thing, and should be used when needed, such as for the Party Model


For money amounts, use NUMERIC(19,4)

> For money amounts, use NUMERIC(19,4)

Note, if you'll ever encounter yourself dealing with cryptocurrencies you would probably found yourself wanting to have at least 18 decimal places (BTC and lots of coins have 8, and ETH has 18).

From the title, I thought they would list the one mistake I've seen junior sysadmins make: not changing the default values for resource allocation [1] in a fresh install.

Good to know that I don't need to calculate the size of my varchars, though.

[1] https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serv...

I actually disagree with "Don't use timestamp (without time zone)". The Java class library (and before it, Joda time) distinguishes "instants" and "locals" which is pretty much the same as PostgreSQL's "with time zone" and "without time zone". They are just different data types with different use cases, and saying that one is preferred is misleading.

I like to compare them to bytes and characters. You would use bytes if you need bytes, and you would use characters if you need characters. They are unrelated data types unless a character encoding is specified. Likewise, instants and locals are unrelated data types unless a time zone is specified.

BTW, the manual page links to https://it.toolbox.com/blogs/josh-berkus/zone-of-misundersta... which explains that "with time zone" doesn't store a time zone.

I cannot but disagree: there is no such a thing as time without a timezone. It is just an undefined time. The "local" timezone could be anything and could change any time. Not recording it is just data loss.

Java Instant is a timestamp in UTC, but it's all too easy to make mistakes in serialization as the timezone information is just not present. I have spent a few hours fixing exactly that yesterday in my day job.

The page you link says that with timezone makes sure to convert to and from UTC, which is just a transparent optimizing implementation detail. If you don't do timezone conversions at DB level, you have to make sure your applications do, which is a recipe for disaster.

There absolutely is such a thing as a time without timezone. If I schedule an event for a time in the future, I want it to occur when a wall clock in that location reads the time that I specified, even if local government alters time zone rules for that location in the interim.

If I set an alarm on my phone for 6am, I want it to go off at 6am in whatever time zone I happen to be in, even if I travel after setting the alarm.

That's very much a time with a timezone (the timezone being the place where the event happens).

However it's not a postgres "timestamp with timezone", which (as both TFA and fphilipe's article note) doesn't store a timezone.

The problem of "future event in a local place" is why the "store everything as UTC" trend is wrong, incidentally: if the timezone's offset changes (which happens regularly, sometimes with very little warning) the stored time is now completely off, by minutes to days[0] depending on the change. "Store everything as UTC" is only suitable if you're only ever storing past events (timezones don't normally get retroactively altered).

[0] https://www.timeanddate.com/news/time/samoa-dateline.html

But you are tying it to a timezone: the timezone that will be valid at the place and date that the meeting is at.

Basically, a timestamp should be tied to a timezone or a location, or at worst an indirection (like you being at a specific location). And since we don't tend to have practical support for tying timestamps to locations, we tie them to timezones instead.

I've written a post about timestamp vs timestamptz and the only valid reason for timestamp I could come up with is exactly this, scheduling events in the future: https://phili.pe/posts/timestamps-and-time-zones-in-postgres...

I have implemented an event ticketing system. One of the things I (fortunately) realized early was that storing times with timezones would be a problem: Events get moved, occasionally across timezone boundaries. If you store epoch-instants, then the logic for "set location" would also have to update every timetsamp in the database associated with the event.

IMO, in a database you should store the "natural" representation of date-time as your users think of it. Sometimes (like your alarm clock), it's timezone-less concept. Sometimes (like a sporting event) it's a combination of datetime + location. Sometimes (like historical events) that's a fixed instant in time. Recognize which is which and model appropriately.

I think what the parent tried to say is that there's no such thing as timeSTAMP without a timezone. There's like you said a time without a timestamp, but it does not contain a date as well.

I am not saying it is the best way to handle things, but for cases where physical locations are well defined they make sense.

I can imagine Walmart or Ikea saying "We open at 8 am worldwide".

Your example is a time without a date, which of course does not need a timezone.

Any use of the word time above actually meant datetime.

"We open at 4AM on Black Friday this year"

There, better?

A date doesn't really change anything. Someone could say, "We open on October 1st, 8 am local time worldwide".

The question is what do you do if you really don't know the original timezone because it wasn't recorded? You don't want to guess at data that wasn't there.

Obviously a new system shouldn't do this, but old data is messy.

The big problem with TIMESTAMP WITHOUT TIME ZONE is that if you do try to insert a datetime with a timezone included, the timezone is simply silently ignored.

The confusing thing about TIMESTAMP WITH TIME ZONE is that it doesn't actually store a timezone! It simply converts input with timezone info to a global instant of time.

Since when using a database I want the same data out that I put in I would default to using the WITHOUT TIME ZONE as the database isn't storing a timezone, it's converting. I want that conversion deferred until the last moment, until whatever locale aware software facing the user gets the data and does the right thing THERE.

It's important to note that "with time zone" doesn't actually store the time zone, but converts to UTC.

If you actually need the original time zone preserved, you need to store it separately.

This is the real thing

I've written about this topic before. Maybe it convinces you that you should (almost) always use timestamptz.


If used as a single source of truth, the DBMS should store true propositions about the world, and that generally means that events are recorded as happening in the correct time zone.

Also, datetimes that predict forward or recall back far enough may be a case where a timezone is incorrect. The deciding factor is whether you're able to make a true statement.

If you're using a DBMS as a dumb data container, and there are plenty of good reasons to do so, then it may also make sense not to use timezones. If you're doing intermediate calculations or sharing with systems that don't support timezones, those values often shouldn't have timezones. That's also why all datetime libraries support times sans timezone.

Using text with check constraints is a good one.

Setting up the constraints is a tiny bit more work, but it's worth it in the end.

It's especially nice when your ORM or web framework knows to deal with all of this at both the migration and validation levels. Then you can have both database level protection and nice human readable errors in your app.

Some of the more surprising ones (to me):

1) Don't use timestamp (without time zone)

2) Don't use varchar(n) by default - just use varchar or text without any length specified

3) Don't use money - The money data type isn't actually very good for storing monetary values. Numeric, or (rarely) integer may be better.

> 1) Don't use timestamp (without time zone)

The problem with this one is both timestamp and timestamptz are problematic:

* timestamp will implicitly discard input timezone information * timetamptz doesn't store a timezone, it converts everything to UTC and stores that; furthermore it assumes any data without timezone information is in the "timezone" system parameter and converts from that on input and to that on output

Storing everything as UTC is problematic because it's completely unreliable for all future events e.g. it's plain broken for calendaring and reminders.

The money one is a bit worrying, at least if "numerical" includes floating point types. It's almost never a good idea to use these for monetary values.

If you are going to have different currencies I almost always would go for integer types (I don't understand the "rarely" advice here) as the fraction may vary greatly for different currencies. It's better to derive the fractional from the currency.

`numeric` is a decimal type with an optional fixed-precision mode, so it works fine with decimal currencies.

Anything similar for MySQL?

Yes, don't use MySQL, use Postgres.


It's a well established fact that PostgreSQL is superior to MySQL in just about every category that appears in any comparison. Sometimes MySQL is just plain broken, e.g. that schema changes cannot be rolled back in a transaction.

One would be: Never use "utf8" character set, only "utf8mb4". "utf8" is limited to those characters that only use four bytes, and everything else will be truncated. "utf8mb4" is what you want when you think about UTF-8.

Don't use UNIX_TIMESTAMP() or FROM_UNIXTIME(). Both overflow in 2038.

I hoped you were kidding me, but no:

    $ perl -wE 'say scalar localtime 2177449200'
    Sat Jan  1 00:00:00 2039
    > select FROM_UNIXTIME(2177449200);
    | FROM_UNIXTIME(2177449200) |
    | NULL                      |
    1 row in set (0.000 sec)
I work on a code base where the word UNIX_TIMESTAMP occurs 400 times :(

I wanted to come back and ask the same thing, only to guess there would be some snarky replies. Sure enough, there are, but sometimes we have to make the best of what we do have. I can't switch to PostgreSQL because some posters on Hacker News think or say MySQL is terrible or shouldn't be used for technical reasons. There are many highly successful profitable companies, small businesses, and start ups that have built their backend using MySQL. I'm not saying MySQL is better either.

To the topic of the post, if anyone knows about succinct list similar to this for MySQL that would be great. I tried to search for something after reading this, but wasn't having much success.

Don't use --password for the CLI

When to Use MySQL?


Any tool available to check a schema or a set of sql scripts for those common mistakes?

Flashback to the time I encountered a project where they'd been storing text in bytea columns 'for performance'.

To make matters worse, there were no indexes on anything other than on primary keys.

It looks like the only safe way to use the Rule system is if you replace the query with exactly one new query. Basically what a view does.

They should really deprecate all other forms. I’ve wasted time reading up on this feature several times, always to conclude there is a better solution to my problem.

Maybe I'm thinking of this wrong, but this is one of the strong suites of NoSQL IMO. Especially when you're starting to build a new service, where you're not sure what exactly your data structure is until the whole system comes together, it's really easy to iterate and add new columns, change existing ones, etc. As long as your code treats each item as its own entity (rather than a row in a predefined table), you can continue to try new formats until you eventually lock in something that does everything you need.

SQL is much less forgiving. Database migrations are terrifying, but the longer you put it off the longer migrating will take (since new rows will continue to accumulate).

Perhaps the solution is "plan out your data structures ahead of time", but given that we're human, I think it's really difficult to do so correctly on the first try. For example, you may have an int column to keep track of the number of "Likes" an item on your service gets, but it may not be until later that you realize you also need to keep track of who liked what in order to prevent users from liking something twice. This is an elementary example, but the point is that when you design tables, especially if you design them around the frontend, you end up leaving out some critical columns that the backend needs to implement its features. Whenever you end up discovering this, you have to not only add that column to a new table but also retroactively insert every existing row into the new table. At best this is O(n) but if you're using a database there's a good chance your n will be quite larger.

You could argue that maybe in the startup stage your n will still be small, since you won't have a billion users off the bat, but I would counter that you usually don't realize what you need to change until someone else breaks something (assuming you have tests for your code already) or requests something new, at which point you likely have an established user pool.

Database migrations are not hard when you have hardly any data, when you are still building something, and they aren't that hard unless you have enormous amounts. On the other hand, having data where records have different attributes is annoying to have to handle everywhere at runtime, especially if it is because of something you have changed, and no new records are missing it. You just end up with code littered with handling of differences due to old design.

This is, IMHO, not a good reason to use NoSQL.

When you want to change how your data is stored you have two options:

1) Don't touch the old data, and make sure all your code knows how to handle both the new and old layouts

2) Update the old data to comply with the new layout

#1 is cheaper in the moment, but makes your software more fragile because of the special cases: you now have two codepaths to check for every change instead of one. Also, whatever improvements the new layout is supposed to provide aren't available to the old records.

#2 is more expensive right now, and scarier because your migration will be touching everything at once. If it starts trashing data, it'll trash all of it before you have a chance to stop it. On the other hand, after you've successfully migrated, all your data is consistent and you don't need to forever deal with legacy records in the active codebase.

Both approaches can be used with either SQL or NoSQL datastores, and there's no reason you have to pick the same strategy for every schema change. There's pros and cons to both approaches and which one is better will depend on your particular situation.

Anyone know which of these are true for Redshift as well?

It seems like there’s a lot of things you should never do. Why don’t they just remove said things and make everyone’s life simpler?

Redshit reads ‘text’ as varchar(256). Calling all AWS redshift devs, can we get this on the roadmap?


I work at a large company, I’ve often seen SQL using this idiom, even in code written on DE teams.

Man, I just started using pgsql and I fall foul of at least 3 of these.

I am going to give pgsql a try soon and I'm not surprised. The password flag is something I would totally have set. CamelCasing tables is something I would do. Between and table inheritance were recommended in a post last week, so I would totally have tried these out too.

A lot of this stuff just needs to be deprecated (case-sensitive tables, password flag...) or come with (bigger?) warning labels (between query, table inheritance...).

Damn, I wanted to try this table inheritance thing some day.

am I the only person bothered by table names being converted to lowercase and having to painfully quote the table names that are capitalized?

This is one of the main things I hate in PgSQL... Arbitrarily converting identifiers in any way is a very bad practice in my opinion. I should be the one to decide how I want to design my database and namings, not some arbitrary rule. I especially want my database and my code to be consistent, and snake_case is far from being the most used convention...

You sure? All the projects I've worked on have used snake case. I've only seen newcomers to programming use capitalized table names.

I'm working with data where the columns in another DB are already CamelCase and I want to maintain consistency. I'm not a newcomer- I've been programming for 30+ years.

I think parent meant in programming languages generally.

Agreed. I work on databases where column names are capitalized throughout our source code, and it’s annoying when sql like “SELECT FirstName FROM ...” gets output as “firstname”.

This is one case where MySQL seems a bit better.

Note if I could I’d change to lowercase underscore naming conventions, but it’s impossible to change now.

It works the same way in Oracle - i.e., it is case-insensitive. What RDS are case-sensitive in terms of things like table/column names?

The case-insensitive identifiers is SQL-standard. I think their issue is that pg lowercases identifiers before creation / storage. That is

    create table Foo ( Bar serial );
is equivalent to

    create table "foo" ( "bar" serial );

    create table "Foo" ( "Bar" serial );

I like this DDT list... Don't Do This wikipage

Anything similar for oracle?

It is 2019 and this site is not mobile friendly!

#!. Using it

Title should be "A list of things not to do when using Postgres" or something like that. "Don't do this" without elaboration is the very paragon of click-bait.

As the domain is displayed and is postgresql.org doesn't that rather hint at what the content might be?

The domain is not displayed in the RSS feed (which, IMO, is a frustrating oversight).

Not when using a feed reader. Yes, RSS is still a thing.

Doesn't your feed reader show the domain?

No, not on a phone. Using Feedly, which is quite a popular reader after Google Reader died. On phone you need to go into the comments before seeing the HN page which has the link in it.

> Don't Do This (postgresql.org)

I think the average person can figure it out.

Each of these could be hosted on postgresql.org:

> Don't Do This: Mistakes we've made in licensing

> Don't Do This: A case for considering MongoDB

> Don't Do This: Handbook for social behavior at events & conferences

I suggest “Don’t do this — A list of common mistakes when using PostgreSQL”

Considering that the title on the webpage starts with it and the url ends with it, I don't see the issue in this case

“Things we should have deprecated, but didn’t”

As a novice in the backend world (I'm a front-end guy), it makes it odd to read stuff like this when learning a technology for the first time. I know a lot of the reasoning is "legacy" as to why things won't have been deprecated but it does add to the confusion when trying to learn not just what do I need to do to perform some task, but why am I doing it this way etc.

legacy code and backwards compatibility are dumb excuses to not cleanup the codebase and make bad practices like those listed impossible.

there's always deprecation notices, deprecation flags and "not giving a shit that somebody's 20 year old codebase breaks on today's version of $project".

That attitude has killed countless software projects. Conversely, many projects only exist because they value backwards compatibility above almost all else.

"Bad practices" often don't start out as bad practices; circumstances change. Additionally, some "bad" things are more like "dangerous if improperly handled" things (e.g. Rules), so making them impossible cripples users.

Lastly, there exist some useful, maintained libraries/tools with a substantial proportion (or majority) of "client" codebases at or greater than 20 years of age. Just because something has been around for a long time is not an excuse to break it. Just because something has not been maintained does not mean it is necessarily poor engineering or deserves to suffer for the passage of time.

sorry, i don't buy it. it's not a matter of attitude, it's just common sense and sanity. if i maintain a library X and i recognize that some behavior is producing dangerous/harmful results - no sane person should ever complain about me deprecating something i consider dangerous/harmful.

either don't upgrade or fork away, don't make others suffer for your unwillingness to change.

the Rules example is mentioned often - i suggest not to take everything written on the internet so literally, or is it really impossible to convey sarcasm without /s?

a better example commenters could focus on is the type `money` - if best advice from postgres about handling money is to not use `money` type - it shouldn't be part of postgres. how is that in any way controversial?

This is controversial because people run businesses on 20-year old software, and they don't want to run their business into the ground every time some outside developer wants to clean up their code base. People's livelihood is at stake. They prefer software that remains solvent. Natural selection has produced a world where people value backwards compatibility.

And you should always be able to upgrade the software you have without fear of breakage, because maintaining a fork and/or deciding which versions to use is often untenable as it relies on specialized knowledge which may not exist in your organization. If you as a software dev want to maintain a clean code base with no deprecated features, you are free to build one. That's not what this is, and you shouldn't expect anyone else to use it if it were.

"sorry, i don't buy it. it's not a matter of attitude, it's just common sense and sanity."

It's not 'common sense' or 'sanity' to break existing code bases. Companies and individuals have generally spend millions on their software and breaking changes cost time and money. There's a reason that Python 3, for example, took over a decade to be fully adopted. Breaking changes should only be implemented with considerable evaluation of the actual need and of existing options. If users cannot rely on you to not break their systems, they will move on to someone they can rely on.

"either don't upgrade or fork away, don't make others suffer for your unwillingness to change."

I manage a number of databases, mostly of a financial nature, and I can tell you that if I were using a database system developed the way you suggest I'd be in trouble. On the one hand, I have to keep my data secure so I have to patch my systems on a regular basis so I can't use a version of a database system that isn't being actively updated. On the other hand, I have to manage hundreds of thousands of lines of mission critical code, most of which I didn't write, and a large amount of data which absolutely, under no circumstances, can be lost, if I want to keep my job. To that end, I can risk breaking changes and usually wouldn't have the bandwidth to accommodate them if I could. There's a reason that many big companies are still using mainframes despite their being better options: rule number one as a system administrator is to not break the existing system.

"a better example commenters could focus on is the type `money` - if best advice from postgres about handling money is to not use `money` type - it shouldn't be part of postgres. how is that in any way controversial?"

Let's actually take this example seriously. Let's say that there is an off-the-shelf accounting system that uses PostgreSQL and because of bad decisions made in the past, uses the 'money' datatype internally. Presumably they've worked around its limitations so it's no longer an issue for them. Let's say that PostgreSQL decides to remove this datatype in the next release of the database. Well, suddenly not only does that accounting system no longer work, bu every company that uses the software is SOL. In addition, any integrations and custom code that worked with that accounting system are now broken. This is a loss of millions of dollars in software dev time at the very least. There's also a possibility of dataloss, which when talking about accounting systems is a matter of legal compliance. This is a much worse outcome than people accidentally using 'money' in new development in future. Considering that the later problem can be ameliorated with just a warning, that seems like the more obvious trade-off.

> suddenly not only does that accounting system no longer work


have i suggested anywhere to suddenly make releases with features removed? does everybody in this thread have a mental blocker on the word "deprecation" or something?

there are ways to remove features properly and gradually. deprecation warnings, hiding the feature behind deprecation flag, moving the feature out into a plugin, etc.

i honestly wasn't expecting to have to respond to such strawman comments.

"have i suggested anywhere to suddenly make releases with features removed?"

Yes you clearly did:

"legacy code and backwards compatibility are dumb excuses to not cleanup the codebase and make bad practices like those listed impossible."

"...it shouldn't be part of postgres..."

""not giving a shit that somebody's 20 year old codebase breaks on today's version of $project"."


Just because you also listed deprecation doesn't mean that you didn't also list removing features and not caring about the existing user base. You didn't clarify which was the better option so it's hard for me to parse this as anything but an endorsement of suddenly removing features and I suspect that's what other people are seeing as well. If you have any doubt about this, look at your comments. You have one comment only mentioning deprecation and that one has no downvotes but the later ones which mention removing features do. If you didn't mean to say that, I would suggest editing the comment or at least clarifying what you meant.

"Things that should not be in the SQL Standard, but are." FTFY

Not true, Rule's are very powerful especially for refactoring tables with 0 downtime.

discussion on lobsters where I posted it recently: https://lobste.rs/s/m4f2ke/don_t_do_this

Clearly the guy who set up the website I'm refactoring didn't read this.

Oddly, no mention of "using MySql". ;-)

what do you mean?

That it'd be a mistake to not use PostgreSQL.

The biggest of these mistakes, in my experience, is mentioning in a public forum that you use MySQL.

(Disclaimer: I picked between the two at random and only continue to use the thing I picked because I now know it well and it is adequate for my needs)

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact