Seriously ? everytime I create a django charfield I try to guess in my head what is the best length for this field to optimize for performance, but sounds like using unlimited length (textfield) field doesn't make any difference. Good news
Its a run-time error too so the index may get created no problem then later someone tries to insert a >900 bytes value and it throws.
Looking at the doc, I see this:
> PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows. This happens transparently to the user, with only small impact on most of the backend code.
So, yeah, large values are stored differently and there's an impact on performance.
Sorry, a blanket statement like "you should use varchar with no length limit" seems extremely ill advised to me. Like, sure, you don't want to use varchar(30) for last names. But varchar(255) is probably going to do you just fine.
That said, I still don't get why they didn't just change the behavior for TEXT/NTEXT vs adding (MAX) options.
edit: I am mistaken, when setting an index, the index max is now the insert limit (from response below).
(off the top of my head, the most likely use case for 2+kb text fields would be things you would be using pg's full text index on anyway, not the usual btree type indexes as you'd want to speed up a join or whatever.)
I am not that familiar with PG full text index types, but in SQL server the full text index doesn't really work well with exact/begins with searches like the b-tree so your kinda stuck in that situation.
With SQL server I like to stick with varchar(255) as it gives head room for unicode indexing and at one point there where some internal optimizations for less <=255 due to the legacy max being 255, not sure anymore.
Beyond 255 its much less likely to need a btree and becomes more of a full text issue.
To the second part: that way lies madness. It's true that you don't know all your future use cases, but it isn't a good reason to leave things unrestricted. It is much easier to increase the limit than to lower it (because you are likely to be deleting data if you lower a limit, which is a tough pill to swallow).
For users of older postgres deployments (<9.2) it's worth noting that changing the limit on varchar is not free, it triggers a table rewrite.
Of course, you mention the stored procedure layer that is required to make this work, so you're probably already aware of this. In that case, remember that you might not need views over literally every table in the database, and even in the case where you do use a view or stored procedure, it might still be much more convenient to do the length limit check on the table rather than cluttering every stored procedure that writes to the table with the same length limit check.
PG is going to be the engine of choice for my next project, so I'm happy to learn about this quirk compared to other engines. Which doesn't mean I'm all that happy about it. I'm really in agreement with this post from 2010 (still current AFAICT) that wishes the support for varchar(n) would be better, for readability and cross-engine compatibility nothing else 
It's the same reason that SQL Server has datetime2 and Oracle has varchar2 (the latter of which is only now being merged into varchar). That's why float, real, and money still exist even though numeric/decimal was created. They behave differently and in ways that are not exactly backwards compatible. In order to prevent your customers from having to rewrite their application just because you updated your data types, they chose to add new data types.
In this case (the no need to specify the len on your varchars) is an implementation detail of Postgres, but the ORM still has to deal with weaker components.
Also for those interested, there’s no difference between varchar and text in Postgres, I just use text fields now, since all length checking is done at the application level, and Postgres is the only DB I need to support
Personally, I prefer the latter. But it seems like in the ORM space the latter approach is nowhere to be found.
Have you ever seen an ORM that allows you to fluently deal with errors raised by triggers? DB-side computed columns [by storing to a table but querying from a related view]? Where the ORM’s “objects” expose an OOP interface to calling DB-side stored procedures that take that row-type as an input? (Or, even more crazily, where you could define a stored procedure in an abstract fluent AST against the object, and have it registered with the database? Or how about exposing DB-side BLOB handles as native-to-your-runtime IO streams, that could be fed into any runtime function that expects such?
There’s really a lot you could do, that’s still abstract to the SQL level (rather than specific to a particular database) if you were okay with some implementations of your interface not supporting every feature.
Usually this means that I have a set of proxy types in the business layer that directly represent the DB types—table row types, table composite-primary-key types, view row types, stored procedure input-tuple and output-tuple types, etc.
Most ORMs support this sort of mirrored-schema modelling between the DB and business layer, with the ability to declare things like reference relationships where the ORM will throw an error if you get back a type you weren't expecting.
But when you drop to "raw SQL" in pretty much any ORM, you lose the ability to specify the business-layer-side input and output types for your expressions, and instead have to deal with sending and receiving the ORM's AST-like abstraction (usually plain arrays of scalars to map to bindings in the SQL statement.)
ORMs that provide a LINQ-like abstraction are a bit different, in that you can use SQL fragments within the fluent syntax to do things like call a stored procedure; and in those cases, they usually allow you to pass in typed arguments, and to specify the output type of the expression (for the rest of the fluent-syntax chain to use in interpreting it.)
While LINQ-like features are pretty nice, they still don't cover everything, because ORMs pretty much always execute their expressions within a transaction, and some SQL features (DDL statements specifically) can't be executed in a transaction. There are also certain types of SQL statements that most RDBMSes refuse to prepare with bindings—e.g. you can't make "which table you're querying from" a dynamic part of the query, even if you're willing to cast the result to a static "base" row type [as in PG's table inheritance.]
Look I was being hyperbolic but SQLAlchemy is great for prototyping but it’s then it’s too easy to just keep building on that crutch.
If I had just used the text datatype in the first place (along with a check constraint), these migrations would have been so much simpler.
I mean, most of this article reads as a list of features postgres never should've had. If PHP can deprecate magic quotes, then why can't postgres do something similar? Why do the "string data type" docs contain anything other than the "text" type, outside a section titled "deprecated, do not use"? Why isn't there a strict mode that warns me when I'm trying to use Bad Ideas From The Nineties?
Every tool that's powerful enough needs to have caveats. I agree it's a PITA, but this is orders of magnitude better than say, bash, C++, Linux, or PHP.
They can't deprecate varchar and some other ill advised field types because they are required to comply with the SQL standard.
The text thing, though, has never been a secret. Learning about it is sort of a coming of age ritual for postgres developers.
Even if somebody actually shows up with a longer name, you know that your application will have other problems in that case (e.g. the UI explodes or something), but you prevent the occasional bug or other problem from writing megabytes into the name field, and causing follow-up problems.
Surely if it's variable-width, that means you can't guarantee it will be on the same page? And therefore there's a performance cost?
Consider a CHAR(1) field. Without unicode support this field is always 1 byte, and therefore easy to store. With unicode support, however, the value that fits in this field is actually between 1-4 bytes. However, in many cases users will only store 1 byte in this field. Storing this field as a fixed length 4-byte field will in many cases result in a 2-4X increase in size and hence waste a lot of space. Thus even with CHAR(1) it is usually better to store the strings in a variable-length manner to avoid wasting space; and CHAR(1) is the absolute best case for this type of optimisation. This gets way worse with larger CHAR lengths.
It's also worth noting that even if we could store a fixed one byte that that is likely still not the best way of storing a CHAR(1) field. A CHAR(1) field is typically used to store types or categories. Typically, a column like this would have few unique values (~2-8 values), which means we can reduce the size of these values to 1-4 bits using dictionary compression instead of a single byte, resulting in a size savings of factor 2-8X over storing a fixed one byte length.
The use cases for CHAR instead of VARCHAR that I'm thinking of would be things like serial numbers. CHAR(1) seems even more specific; the only time I've ever seen that used is for some version of enumerations with mnemonic keys.
In Postgres, if I had a truly massive number of alphabetic-textual serial numbers (such that I was worried about their impact on storage/memory size), I’d define a new TYPE, similar to the UUID type, where such serial numbers would be parsed into regular uint64s (BIGINTs) on input, and generated back to their textual serial-number representation on output.
char(n) doesn't reject values that are too short, it just silently pads them with spaces. So there's no actual benefit over using text with a constraint that checks for the exact length.
So even if you're using pure ASCII, you're supposedly better off just setting the character encoding appropriately and using a text or non-limited varchar field.
That said, I thought the main benefit of a char field (and its padding) was that you could use it in select instances where is was useful to have specific length records. E.g. if all the other fields of the table are fixed length, every record can be fixed length if you use a char(n). I'm not sure if that's a benefit or not to modern databases anymore though.
Some RDBMS's don't store the string's length with each value for fixed-width character fields. So a fixed-length field may still be preferable if the space lost to padding amounts to less than the space lost to overhead.
That distinction doesn't apply to PostgreSQL, because it stores a length along with the value even for fixed-with columns.
Same with name and address fields. I've seen databases where they set it at an seemingly arbitrary number, because that's the maximum length their delivery service could handle (UPS etc.), but then they added another one with different constraints…
If you have a known fixed width you can be certain about your memory grants required for a specific query (width * rows estimate) - if you don't know that ahead of time you might find you need to allocate more memory than is actually required.
This can be really bad for concurrency.
• one “regular” PG table which has a row record type defined with a fixed length (like a filesystem inode), where all the static-length fields are held in the record itself, and any unbounded-length fields have a certain amount of space reserved for them in case they fit into that small space (if they’re even smaller, this space is just wasted);
• and one TOAST table—essentially a table of extents, that the unbounded-length fields of the previous record can, instead of holding data directly, hold pointers into.
TOAST for such columns only gets used when the column’s reserved size within its record gets overflowed; which I believe, for current PG versions, is 127 bytes. So, any row-type where all the fields are bounded to a length of at most 127 (either by literally using VARCHAR(127), or CHAR(127), or by using a TEXT column with a CHECK(length(col) < 128) constraint)—will never use TOAST, thus guaranteeing your stride width and memory bounds.
And, as well, any query that doesn’t try to “unwrap” the value of those unbounded-length columns, won’t have to join against the table’s TOAST table. So you can optimize for the memory bounds of specific queries on a table (by bounding the length of certain fields, keeping them out of TOAST) while leaving other fields as unbounded-length, if no hot-path query uses them.
(There’s also an extra optimization where values that are just barely too large to fit into the 127 bytes of reserved in-record space, are transparently compressed, to see if that’ll make them fit. Thus, although you need the length-check to guarantee that you won’t be using TOAST, it may turn out that your table “gets away with” having some values a bit larger than the bound without ever using TOAST.)
A question - if the planner knows its joining to the TOAST table, and it has some predicate to evaulate, how does it know all the widths of all the rows so it doesnt over-grant(or are memory grants not a thing?)
Generally with statistics/histograms you get a sample and a max width (or something like that) - so I am just confused how it can be precise.
So while you never suffer from "under" resourced queries, you generally dont get to see what they would look like if they had more memory.
In the SQL Server world that memory grant is part of the query plan, so that's why its a question that occurred to me.
I've seen it be FAR wrong about memory, though, despite being right about column widths: When you join a few tables in a big select, the planner will look at some sample rows and guess the number of rows in each step of the possible plan, and if the sample is too small, that estimate can be wrong by a LARGE factor.
Pretty much, except if you're using ModelForms you'll end up rendering a different element by default, and the admin (if you use it) will be different. Not really a problem with DRF without the browsable API.
 - https://sqlperformance.com/2017/06/sql-plan/performance-myth...
SQL Server is probably like that because varchar(max) is relatively recent; I'd expect as their support for it improves that issue will go away.
That used to be a good idea. I guess it's fallen the way of defragging, or "parking" your hard drive.
One of the reasons given in the bug for it being difficult to fix is down in the depths it assumes it is a varchar() -- with parentheses.
Here's the relevant comment:
If I'd read this first, I wouldn't have wasted Saturday finding out for myself why you shouldn't use it.
>Table inheritance was a part of a fad wherein the database was closely coupled to object-oriented code. It turned out that coupling things that closely didn't actually produce the desired results.
Honest question. If we should never use it, why have it in the first place as an option ? Is it some type of backward compatibility issue ? This list is great btw. Why we shouldn't do something is a great way to learn stuff.
> Why not?
> Even the manual tells you it's only implemented for SQL compliance.
> Don't use psql -W or psql --password
If I do a psql --help I see
-W, --password force password prompt (should happen automatically)
> This option is never essential, since psql will automatically prompt for a password if the server demands password authentication. However, psql will waste a connection attempt finding out that the server wants a password. In some cases it is worth typing -W to avoid the extra connection attempt.
So this is definitely confusing.
What they probably mean is that you shouldn't do something like --password=mypassword, which will work but is obviously dangerous to do.
If this happens you've lost already, -W or not.
Still, if you know that you're connecting to a server that's setup with password-based auth, then what would be the point of waiting? You'll inevitably have to provide a password as some point anyway, so the keylogger argument seems silly.
And if someone has the ability to install a keylogger on a machine that uses peer or certificate auth, then that someone almost certainly already has the ability to just connect directly using that same peer or certificate auth.
.. which is what original doc says
There are a few countries where they no longer bother with small coins to represent .01, so cash transactions are rounded to nearest .05 or similar, but the .01 is still valid for electronic transactions. I expect that is the situation in Sweden. (I wish US did this too)
Electronic transactions, which account for over 50% of transactions in Canada, still use pennies though.
There's (sadly) no rule that everything must be a denomination of 0.05 SEK or even 1 SEK. Pretty much every shop continues to use cents in their pricing (like .50, .95, and .99). If you pay in cash you'll end up rounding to nearest SEK (rounding up if it's below 1 SEK). You could potentially save money by paying with cash if your transaction would be rounded down... but at a maximum of 0.49 SEK (~0.05 USD) saved per transaction you'd need some unique spending habits to actually utilize this "trick" efficiently.
Source: https://www.riksbank.se/en-gb/payments--cash/notes--coins/co... and I happen to be Swedish.
Normally I would want to enforce the use of UTC for points in time (as opposed to user entry) across backend code and APIs. If the column name is "eventtime"... then it is (of course) UTC...
I guess one could use timestamptz with a constraint saying that timezone is UTC, but is there a point? Timestamps in DBs should never be non-UTC...
[postgres] # create table test(ts timestamp);
Time: 29.807 ms
[postgres] # insert into test values('2019-05-03 17:51:00');
INSERT 0 1
Time: 6.637 ms
[postgres] # select * from test;
[postgres] # select ts at time zone 'utc' from test;
Time: 17.506 ms
[postgres] # select cast(ts as timestamptz) from test;
Time: 0.428 ms
[postgres] # select cast(ts as timestamptz) at time zone 'utc' from test;
Time: 0.329 ms
[postgres] # alter table test alter column ts type timestamptz;
Time: 14.477 ms
[postgres] # select * from test;
Time: 0.399 ms
[postgres] # set timezone = 'America/Los_Angeles';
Time: 9.281 ms
[postgres] # select * from test;
Time: 0.364 ms
There was an article recently, describing a very important, and also very common, "exception":
If you want to schedule/synchronize real-world appointments/events, you need both the _local time_ and the political zone, like 2018-12-11 9am at Europe/London.
The reason is that between time of entry and time of event, the defined timezone for that area may change. As a result, your previously translated appointment would not match up the local time.
Which is bad, because humans, when meeting, or events, when being announced, announce the local time. Not UTC. And they stick to the local time.
It basically lets you fold all the logic of converting datetimes into UTC and back out of UTC to local time for display into the DB itself. Worst case, if you want UTC and to do it in the application logic, just set your connection timezone to UTC, and you'll be expected to provide and get back UTC times.
There's no need for constraints, that's exactly what timestamptz does. If it's provided a zoned datetime it'll convert that to UTC then store it, if provided a non-zoned datetime it'll assume that's in the local tz ("timezone" system setting) and performs the same conversion.
timestamp, by comparison, will just discard any timezone information (if it's given a zoned timestamp it'll simply discard the timezone information and store the timestamp's value as-is).
> Timestamps in DBs should never be non-UTC…
That only works for past events. Storing future events in UTC is usually broken, because future events are normally relative to a specific timezone (e.g. a meeting at 6 in Tusla, or a party at 8 in Melbourne). If you store them in UTC and the timezone's offsets change your occurrence is now incorrect.
I think you might want to consider why some people might actually care to store timezone data, and why it almost always is needed. Timezone data is separate information that if not included, is now missing from your data forever unless someone goes and backfills it in. Without the explicit timezone stored, your timestamp could technically be in any timezone as far as the database is concerned.
> I guess one could use timestamptz with a constraint saying that timezone is UTC, but is there a point?
The point is you now have a data type that can properly use the built-in functions to more easily run calculations against proper timestamps that do have timezone sensitive calculations.
As the article points out, confusingly, problematically, and despite its name, "TIMESTAMP WITH TIMEZONE" doesn't, in fact, store timezones.
What it does is implicitly convert between UTC (the storage) and zoned timestamps (anything you give it, either explicitly zoned or — if non-zoned – assumed to be in the server's "timezone" setting).
Thankfully TIMESTAMP WITHOUT TIMEZONE doesn't store timezones either (how weird would that be), what it does is discard timezone information entirely on input, storing only the timestamp as-is: if you give it '1932-11-02 10:34:05+08' it's going to store '1932-11-02 10:34:05' having just stripped out the timezone information without doing any adjustment whereas timestamptz will store '1932-11-02 02:34:05', without maintaining any offset either but having adjusted to UTC:
# select '1932-11-02 10:34:05+08'::timestamp, '1932-11-02 10:34:05+08'::timestamptz;
timestamp | timestamptz
1932-11-02 10:34:05 | 1932-11-02 02:34:05+00
Time: 0.466 ms
On that note though, I wonder what impact (if any) this has on an index covering the column? I could be off here but from my mysql days I was under the impression the index would be more efficient if there was a length limit on the column.
Warnings (necessary and useful for quicker feedback loops) only tell you not do something.
When would json be preferable to jsonb? I can imagine cases where write-speed is essential and fancier queries and indices aren't needed, but that seems rare.
My current understanding is that json is faster and bigger than jsonb.
FYI, it's the other way around. When you use the json type, Postgres validates it and stores the raw string. Operations on json are very slow. Whenever you extract a field from a json column, Postgres has to parse the full json.
On the other hand, jsonb is stored in a custom binary format. This makes operations a lot faster as Postgres does not have to parse the full json. There's also some other advantages such as jsonb stores the keys in sorted order making it possible to do a binary search to find a specific key in a lookup.
There are two primary downsides of jsonb. First, it doesn't preserve whitespace. This can be a problem for some use cases. Second, since there is a lot of metadata in jsonb, jsonb is larger. Postgres stores information like the offsets of every key and value.
"In three years, varchar no longer takes an argument, char and money do no longer exist, psql's password flag will be removed, all tables will be lowercased, BETWEEN will require fully qualified values, table inheritance will be disabled unless you specify a config value, and this page will be shortened to only include NOT IN and SQL_ASCII. In one year, a lot of these things will raise warnings. In two years, it will raise suppressible errors."
(I left out a few for brevity, but only a few -- this would actually resolve most of this page.)
Most of this strikes me as common things (varchar and psql --password in particular), and aside from NOT IN there appear to be easy solutions to make people stop using it without big changes. It's not like python3 where you have to change a core thing like text handling, it's removing a flag from your cronjobs (--password), omitting a value (varchar(n)), or telling the database what you really meant (BETWEEN X AND '2019-01-01' -> '2019-01-01T23:59:59.999'). For stuff without obvious fixes, a "do not do this" page is much more reasonable.
Just look at the python 2 to python 3 transition, which is still ongoing, more than 10 years after the python 3.0 release.
Also, warnings usually simply don't work. Developers or admins google the warning, see that it can be switched off, and switch it off. DDL is often generated (for example for migrations), and you might not want to add a special case for postgres if you can avoid it. Suppressing a warning is definitively easier than changing code generation in potentially non-compatible ways.
Plus, I haven't looked it up, but given how prevalent VARCHAR(N) is, it might be part of the SQL standard.
And yet, yes! The best practice, obviously, is to always use Postgres. So --- what other flavors of SQL?
More seriously, I have read that SQLite is often similar to Postgres. Its main author actually has said in a talk that when making decisions he asks, "What would Postgres do?" Also I have heard that Postgres is similar to Oracle, so much that you could move from Oracle to Postgres with a thin layer of compatibility. In fact I think that's what https://www.enterprisedb.com/ is.
Minor nit, Pg doesn't have SQL standard stored procedures; a.k.a. SQL/PSM. Mysql and DB2 do however.
Looks like there is an extension but it hasn't been updated in 8 years as doesn't appear to be production ready yet
The various timestamp-related data types are also very PostgreSQL-specific.
So I think there's some good high-level advice here that applies across multiple flavors, but the specifics are really only applicable to PostgreSQL.
No, anything like this is pretty implementation specific.
In "Table Inheritance" the article is referring to a postgres specific feature which implements a form of the data pattern by the same name. The pattern is not unique to circumstances where you are working with an ORM. So if your data is conveniently modelled using such a pattern by all means use it, in whatever DBMS you are using, but perhaps not with this postgres specific feature.
In "Not In" the fragment "does not optimize very well" is dependent on the query planner. Other DBMS engines may not optimise this well, but you might want to run performance tests on realistic data to check. The possible confusion when the list contains one or more NULLs is present in anything standard because that is a property of NULL handling.
The comments about case is implementation specific too. In SQL Server for instance it depends upon the collation set for the database, with names being consistently case-sensitive or consistently not (usually the latter) depending on the collation setting and irrespective of any name escaping.
The point about using BETWEEN with date+time values is correct generally, and I have seen it cause confusion a number of times. Though I've seen similar confusion with other comparisons based on date for date+time types too so this isn't specific to BETWEEN.
The comments about time types with timezone information are general to databases that support such types (though not all common DBs do). My advice would instead be to always store times in UTC and translate to the local timezone elsewhere (though this can have its own problems so others may disagree).
String data types and their performance characteristics vary a lot between databases. In SQL Server VARCHAR(MAX) value (the equivalent to postgres' VARCHAR-with-no-length or TEXT) will usually be stored off-page which can have negative performance implications, and can't be indexed, so you wan to use limited length text columns unless you definitely need long values. What SQL Server calls TEXT is essentially a BLOB type and long-since deprecated anyway.
> I can imagine it getting overwhelming when interviewing for jobs that nitpick specific code using specific implementations of SQL
If the job is specifically working with that flavour of database then this sort of detail might come up, there is little-to-nothing you as an interviewee can do about that other than swot-up beforehand if you are used to a different engine. Though such detailed nitpicking probably doesn't belong in a good interview process anyway, at least not until late stages when there are too many otherwise excellent candidates and they need something/anything to differentiate between them to make the choice easier!
If you are interviewing for a job that involves coding for multiple SQL database flavours then you will need to at least appreciate that the characteristics and behaviours of data types vary between engines, even if you don't know all the specifics off the top of your head. Other oddities you might come across include Oracle storing empty strings as NULL (with the possible problems at retrieval time this implies), SQL server ignoring trailing spaces in strings most of the time, and so forth.
That's fine for, eg, recording a timestamp, but doesn't work for a scheduled event. What if you want to be able to create a new event for "same time next week"? Impossible without having the time zone.
I had an app where it mattered so we stored the datetime with a timestamp (as UTC) and had a separate column for timezone.
I don't really see the point of timestamptz. In most cases when you're just storing a moment in time, convert it on the frontend and store it as UTC. If the time zone actually matters (scheduled, future events) then timestamptz doesn't cut it, and you need to store the timezone separately anyway, so again might as well just use UTC timestamp again.
Eh, I guess if your programming language doesn't have good time zone support you could lean on postgres for the conversion to UTC, but in most apps you'll have to deal with time and zones on the frontend anyway.
Timestamptz column values are the UTC timestamp, but the returned time zone aware result will depend on your db settings, or even local time of your client. Which i think is what everyone would want.
Storing an extra column time zone with a timestamp is just an inconvenient way to implement timestamptz.
Standard timestamp datatype always stored in UTC or some specific timezone in another column is the best way to deal with dates.
For money amounts, use NUMERIC(19,4)
Note, if you'll ever encounter yourself dealing with cryptocurrencies you would probably found yourself wanting to have at least 18 decimal places (BTC and lots of coins have 8, and ETH has 18).
Good to know that I don't need to calculate the size of my varchars, though.
I like to compare them to bytes and characters. You would use bytes if you need bytes, and you would use characters if you need characters. They are unrelated data types unless a character encoding is specified. Likewise, instants and locals are unrelated data types unless a time zone is specified.
BTW, the manual page links to https://it.toolbox.com/blogs/josh-berkus/zone-of-misundersta... which explains that "with time zone" doesn't store a time zone.
Java Instant is a timestamp in UTC, but it's all too easy to make mistakes in serialization as the timezone information is just not present. I have spent a few hours fixing exactly that yesterday in my day job.
The page you link says that with timezone makes sure to convert to and from UTC, which is just a transparent optimizing implementation detail. If you don't do timezone conversions at DB level, you have to make sure your applications do, which is a recipe for disaster.
If I set an alarm on my phone for 6am, I want it to go off at 6am in whatever time zone I happen to be in, even if I travel after setting the alarm.
However it's not a postgres "timestamp with timezone", which (as both TFA and fphilipe's article note) doesn't store a timezone.
The problem of "future event in a local place" is why the "store everything as UTC" trend is wrong, incidentally: if the timezone's offset changes (which happens regularly, sometimes with very little warning) the stored time is now completely off, by minutes to days depending on the change. "Store everything as UTC" is only suitable if you're only ever storing past events (timezones don't normally get retroactively altered).
Basically, a timestamp should be tied to a timezone or a location, or at worst an indirection (like you being at a specific location). And since we don't tend to have practical support for tying timestamps to locations, we tie them to timezones instead.
IMO, in a database you should store the "natural" representation of date-time as your users think of it. Sometimes (like your alarm clock), it's timezone-less concept. Sometimes (like a sporting event) it's a combination of datetime + location. Sometimes (like historical events) that's a fixed instant in time. Recognize which is which and model appropriately.
I can imagine Walmart or Ikea saying "We open at 8 am worldwide".
Any use of the word time above actually meant datetime.
Obviously a new system shouldn't do this, but old data is messy.
The confusing thing about TIMESTAMP WITH TIME ZONE is that it doesn't actually store a timezone! It simply converts input with timezone info to a global instant of time.
If you actually need the original time zone preserved, you need to store it separately.
Also, datetimes that predict forward or recall back far enough may be a case where a timezone is incorrect. The deciding factor is whether you're able to make a true statement.
If you're using a DBMS as a dumb data container, and there are plenty of good reasons to do so, then it may also make sense not to use timezones. If you're doing intermediate calculations or sharing with systems that don't support timezones, those values often shouldn't have timezones. That's also why all datetime libraries support times sans timezone.
Setting up the constraints is a tiny bit more work, but it's worth it in the end.
It's especially nice when your ORM or web framework knows to deal with all of this at both the migration and validation levels. Then you can have both database level protection and nice human readable errors in your app.
1) Don't use timestamp (without time zone)
2) Don't use varchar(n) by default - just use varchar or text without any length specified
3) Don't use money - The money data type isn't actually very good for storing monetary values. Numeric, or (rarely) integer may be better.
The problem with this one is both timestamp and timestamptz are problematic:
* timestamp will implicitly discard input timezone information
* timetamptz doesn't store a timezone, it converts everything to UTC and stores that; furthermore it assumes any data without timezone information is in the "timezone" system parameter and converts from that on input and to that on output
Storing everything as UTC is problematic because it's completely unreliable for all future events e.g. it's plain broken for calendaring and reminders.
If you are going to have different currencies I almost always would go for integer types (I don't understand the "rarely" advice here) as the fraction may vary greatly for different currencies. It's better to derive the fractional from the currency.
$ perl -wE 'say scalar localtime 2177449200'
Sat Jan 1 00:00:00 2039
> select FROM_UNIXTIME(2177449200);
| FROM_UNIXTIME(2177449200) |
| NULL |
1 row in set (0.000 sec)
To the topic of the post, if anyone knows about succinct list similar to this for MySQL that would be great. I tried to search for something after reading this, but wasn't having much success.
To make matters worse, there were no indexes on anything other than on primary keys.
They should really deprecate all other forms. I’ve wasted time reading up on this feature several times, always to conclude there is a better solution to my problem.
SQL is much less forgiving. Database migrations are terrifying, but the longer you put it off the longer migrating will take (since new rows will continue to accumulate).
Perhaps the solution is "plan out your data structures ahead of time", but given that we're human, I think it's really difficult to do so correctly on the first try. For example, you may have an int column to keep track of the number of "Likes" an item on your service gets, but it may not be until later that you realize you also need to keep track of who liked what in order to prevent users from liking something twice. This is an elementary example, but the point is that when you design tables, especially if you design them around the frontend, you end up leaving out some critical columns that the backend needs to implement its features. Whenever you end up discovering this, you have to not only add that column to a new table but also retroactively insert every existing row into the new table. At best this is O(n) but if you're using a database there's a good chance your n will be quite larger.
You could argue that maybe in the startup stage your n will still be small, since you won't have a billion users off the bat, but I would counter that you usually don't realize what you need to change until someone else breaks something (assuming you have tests for your code already) or requests something new, at which point you likely have an established user pool.
This is, IMHO, not a good reason to use NoSQL.
1) Don't touch the old data, and make sure all your code knows how to handle both the new and old layouts
2) Update the old data to comply with the new layout
#1 is cheaper in the moment, but makes your software more fragile because of the special cases: you now have two codepaths to check for every change instead of one. Also, whatever improvements the new layout is supposed to provide aren't available to the old records.
#2 is more expensive right now, and scarier because your migration will be touching everything at once. If it starts trashing data, it'll trash all of it before you have a chance to stop it. On the other hand, after you've successfully migrated, all your data is consistent and you don't need to forever deal with legacy records in the active codebase.
Both approaches can be used with either SQL or NoSQL datastores, and there's no reason you have to pick the same strategy for every schema change. There's pros and cons to both approaches and which one is better will depend on your particular situation.
I work at a large company, I’ve often seen SQL using this idiom, even in code written on DE teams.
A lot of this stuff just needs to be deprecated (case-sensitive tables, password flag...) or come with (bigger?) warning labels (between query, table inheritance...).
This is one case where MySQL seems a bit better.
Note if I could I’d change to lowercase underscore naming conventions, but it’s impossible to change now.
create table Foo ( Bar serial );
create table "foo" ( "bar" serial );
create table "Foo" ( "Bar" serial );
I think the average person can figure it out.
> Don't Do This: Mistakes we've made in licensing
> Don't Do This: A case for considering MongoDB
> Don't Do This: Handbook for social behavior at events & conferences
there's always deprecation notices, deprecation flags and "not giving a shit that somebody's 20 year old codebase breaks on today's version of $project".
"Bad practices" often don't start out as bad practices; circumstances change. Additionally, some "bad" things are more like "dangerous if improperly handled" things (e.g. Rules), so making them impossible cripples users.
Lastly, there exist some useful, maintained libraries/tools with a substantial proportion (or majority) of "client" codebases at or greater than 20 years of age. Just because something has been around for a long time is not an excuse to break it. Just because something has not been maintained does not mean it is necessarily poor engineering or deserves to suffer for the passage of time.
either don't upgrade or fork away, don't make others suffer for your unwillingness to change.
the Rules example is mentioned often - i suggest not to take everything written on the internet so literally, or is it really impossible to convey sarcasm without /s?
a better example commenters could focus on is the type `money` - if best advice from postgres about handling money is to not use `money` type - it shouldn't be part of postgres. how is that in any way controversial?
And you should always be able to upgrade the software you have without fear of breakage, because maintaining a fork and/or deciding which versions to use is often untenable as it relies on specialized knowledge which may not exist in your organization. If you as a software dev want to maintain a clean code base with no deprecated features, you are free to build one. That's not what this is, and you shouldn't expect anyone else to use it if it were.
It's not 'common sense' or 'sanity' to break existing code bases. Companies and individuals have generally spend millions on their software and breaking changes cost time and money. There's a reason that Python 3, for example, took over a decade to be fully adopted. Breaking changes should only be implemented with considerable evaluation of the actual need and of existing options. If users cannot rely on you to not break their systems, they will move on to someone they can rely on.
"either don't upgrade or fork away, don't make others suffer for your unwillingness to change."
I manage a number of databases, mostly of a financial nature, and I can tell you that if I were using a database system developed the way you suggest I'd be in trouble. On the one hand, I have to keep my data secure so I have to patch my systems on a regular basis so I can't use a version of a database system that isn't being actively updated. On the other hand, I have to manage hundreds of thousands of lines of mission critical code, most of which I didn't write, and a large amount of data which absolutely, under no circumstances, can be lost, if I want to keep my job. To that end, I can risk breaking changes and usually wouldn't have the bandwidth to accommodate them if I could. There's a reason that many big companies are still using mainframes despite their being better options: rule number one as a system administrator is to not break the existing system.
"a better example commenters could focus on is the type `money` - if best advice from postgres about handling money is to not use `money` type - it shouldn't be part of postgres. how is that in any way controversial?"
Let's actually take this example seriously. Let's say that there is an off-the-shelf accounting system that uses PostgreSQL and because of bad decisions made in the past, uses the 'money' datatype internally. Presumably they've worked around its limitations so it's no longer an issue for them. Let's say that PostgreSQL decides to remove this datatype in the next release of the database. Well, suddenly not only does that accounting system no longer work, bu every company that uses the software is SOL. In addition, any integrations and custom code that worked with that accounting system are now broken. This is a loss of millions of dollars in software dev time at the very least. There's also a possibility of dataloss, which when talking about accounting systems is a matter of legal compliance. This is a much worse outcome than people accidentally using 'money' in new development in future. Considering that the later problem can be ameliorated with just a warning, that seems like the more obvious trade-off.
have i suggested anywhere to suddenly make releases with features removed? does everybody in this thread have a mental blocker on the word "deprecation" or something?
there are ways to remove features properly and gradually. deprecation warnings, hiding the feature behind deprecation flag, moving the feature out into a plugin, etc.
i honestly wasn't expecting to have to respond to such strawman comments.
Yes you clearly did:
"legacy code and backwards compatibility are dumb excuses to not cleanup the codebase and make bad practices like those listed impossible."
"...it shouldn't be part of postgres..."
""not giving a shit that somebody's 20 year old codebase breaks on today's version of $project"."
(Disclaimer: I picked between the two at random and only continue to use the thing I picked because I now know it well and it is adequate for my needs)