Weird, I really dislike a lot of the suggestions. In particular, an example:
SELECT first_name AS fn
FROM staff AS s1
JOIN students AS s2
ON s2.mentor_id = s1.staff_num;
We already agreed that staff is a good name for a table, so why are we renaming it to s1? There are all sorts of subtle bugs that arise when s2.mentor_id = s1.staff_num is wrong, and the variable names provide no help here that we're doing things right.
How about:
SELECT first_name
FROM staff
JOIN students
ON student.staff_id = staff.id;
If you're reading some 50+ line SQL query with a bug, things like this are easy to read and be confident in. Honestly, I think AS should basically never be used for tables.
Likewise, the guide gives the example of naming join tables 'services' instead of 'car_mechanics'. But if I see a table 'car', a table 'mechanic', and a table 'car_mechanic', I instantly know how to join against this in any direction. The hard part of understanding a schema is how it fits together, not what it represents, and when you have hundreds of tables and need to remember that join table's name, it's really, really nice if it's trivial to derive.
Lastly, the guide's recommended indentation for FROM vs JOIN statements seems off to me. Consider:
FROM riders AS r
INNER JOIN bikes AS b
ON r.bike_vin_num = b.vin_num
AND b.engines > 2
Well, the thing is, the riders table and bikes table have the exact same priority in the query, but this spacing really emphasizes the riders table and it's not clear why.
I only slightly disagree with a couple of your points.
1) They aliased it to show how it should look. This is a contrived example, and you normally wouldn't alias these.
2) Aliasing tables is essential when you have join tables that end up being over 30 characters long, or you need to self join for something. Short of using CTEs (which aren't available everywhere), I don't know of a way to self join without using aliases.
> Aliasing tables is essential when you have join tables that end up being over 30 characters long, or you need to self join for something. Short of using CTEs (which aren't available everywhere), I don't know of a way to self join without using aliases.
This is 80-90% of my queries because we do a lot of cross schema joining in my work. Also there's a lot of tables replicated from the mainframe so the table names are T_<4 Letters that mean something to someone>_<Some more seemingly random letters>_<Something vaguely like a word if you're lucky> so their names aren't useful at all. Add in that a lot of queries are reporting queries so we're pulling in 20-30+ columns from these long named tables from different schemas and not aliasing them would add a a small novel worth of text to the query.
> Honestly, I think AS should basically never be used for tables.
I believe AS is required for self-joins. A common expository query is to find all employees who earn more than their managers, which can be expressed via "select name from staff as employee join staff as manager on (employee.manager_id = manager.id) where employee.salary > manager.salary".
Also, SQL allows a sub-select to be listed in the FROM clause. But in such cases, the sub-select must be given a name (tables have a name, but sub-selects don't).
As an aside, the keyword "AS" is allowed by not required.
Sure, I forgot the self-join case. In my experience these are a minority of joins, but definitely. In postgres at least, a lot of self-joins can be solved with a DISTINCT ON instead.
Sub-selection with a FROM clause, yeah, those definitely need names. But those don't have perfectly good names that everybody already knows, whereas tables do. If you're maintaining somebody's code and they declare
foo_bar = 10
I think it'd be pretty strange to ever consider writing
fb = foo_bar
just because you find foo_bar too long to write or something.
why is "tb_" bad for tables? for some reason this has become standard practice for me. at least on SQL Server it can help distinguish tables from views in complex queries or stored procedures. No idea where i learned this habit.
Aliasing tables is very useful. Aside from giving shorter names and helping you to be explicit when selecting fields, they also end up being useful for working with IntelliSense (or the equivalent in the tool you're using). Plus, you can make them memorable enough (e.g. in the first example you shared, could use 'S' for staff and 'ST' for students) for the query you're working on and shortening the query code makes it more readable. For all the reasons above, aliases are one of the first things I'll add when writing a new SQL query.
As for the indentation after FROM, I prefer to put an indent here too. Whilst you're right that INNER JOIN would have the same priority, LEFT JOIN wouldn't, and I wouldn't want to use different indenting styles for different types of joins.
That said, I don't agree with everything in this style guide, especially...
* Do... "Try to use only standard SQL functions instead of vendor specific functions for reasons of portability."
* Avoid... "Descriptive prefixes or Hungarian notation such as sp_ or tbl."
With the standard functions, it strikes me that's optimising the wrong thing. How many times are you likely to switch database engines? Such actions are very rare, especially for teams with good grounding in RDBMS'. What's more important to me is readability of the code, and there's plenty of useful non-standard SQL for improving readability. If we want both we should be pushing for updated standards to be applied across the board.
As for the second point, it's useful to know whether you're looking at a table/view/function/stored procedure/trigger from the name alone. Whilst the tbl prefix seems superfluous, other database objects should have a descriptive prefix in the name IMO.
I'm aware of that. However, it only has to be memorable for the period of time you're working with a query. If you open up an old view or stored procedure to modify it, you would read the table aliases first. If someone wasn't able to retain the meaning of 'S' and 'ST' for the length of time needed to modify a query then perhaps working with code isn't for them.
Lastly, if someone did have poor memory they could lengthen the aliases to STA and STU.
Better :) But I prefer to just give my tables short names to begin with. How about staff and pupils? That's so short that there is little temptation to abbreviate.
> How many times are you likely to switch database engines?
Not often, but you may wish to support multiple engines at once, and it makes your code a lot easier to manage if you don't have to have a separate set of queries for each engine.
I really agree that the recommendation on aliasing is just plain wrong. I prefer using singular names relating to the role-in-the-query, e.g.:
SELECT manager.last_name AS manager_name,
employee.last_name AS employee_name
FROM staff as employee
JOIN staff as manager
ON employee.manager_id = manager.staff_id
This is probably most important (as in the above example) with self-joins, but I think its a good practice more generally, and makes queries more self-documenting as to intent.
SQL isn't BASIC, and we aren't limited to x (1,2,4, etc) character variable names. Why not just use the existing meaningful name? They aren't that hard to type, and you might even be able to use autocompletion, depending on environment.
I've spent far too much time re-writing queries to remove obfuscating aliases to accept a blanket statement that they make it easier to get your head around the code.
In my line of work I often end up working with queries where you'll have a handful of JOINs and need to SELECT a dozen or more columns. In those cases writing out staff.foo or students.bar gets old really fast.
I agree that s1 and s2 are bad though, as I said in another comment.
>In my line of work I often end up working with queries where you'll have a handful of JOINs and need to SELECT a dozen or more columns. In those cases writing out staff.foo or students.bar gets old really fast.
It's better to have it "get old" though, than to introduce subtle bugs because of similar, short, aliases.
Longer Names have there own issues and remember not every one can recall long strings. dyslexics and other neurotypical people often have difficulty with this
There is no way you can port SQL Server to MySQL to Oracle to PLSQL without rewriting virtually every single query. They're all too different.
So why bother? Write idiomatic code that other programmers used to that engine will understand.
Not only that but different engines like different things. That blazingly fast nested subquery in MS SQL will become a massive performance problem in MySQL.
There is no such thing as portable SQL, don't pretend there is.
Not only that, often the standardised way is worse than the vendor specific way (e.g. INFORMATION_SCHEMAS vs built in sp_ from MS).
I completely agree with this. One should never plan to change database engines -- it just happens so rarely and it's such a big deal that planning for it is a waste of time.
It would be like coding in Java in such a way to make porting to C# easier in the future (or vice-versa).
It's highly unlikely you could ever use generic enough SQL to go from open source DB or Oracle. Best case you wouldn't be getting the best performance out of your open source database and then you wouldn't be getting the best performance out of Oracle.
It's better to build for the platform you have than build for a mythical common platform that contains the minimum features of syntax of all of other DBMS (which probably doesn't exist). If you change DB platforms, you're just going to have to rebuild.
Supporting SQL Server, MySQL, Oracle, Postgres, DB2 and some others with standard SQL on a large Enterprise codebase here.
Works just fine.
Could we override anything that needs better performance with custom queries ? Sure, and we occasionally do, but for 95% of use-cases, it's irrelevant.
Of course, we'd love to just focus on one database, but don't get confused in this day of cloud-everything that 'suitable' performance for very large companies isn't possible.
It's a different matter when you're approaching from a "support everything" perspective.
And to do that, I assume you're handcrafting all your queries, have to teach every developer how to do that in a supported cross-engine way and are severely limited in the set of functionality that you can actually use.
A simple example is that both My SQL and SQL Server's tools love adding ` an [ to their auto-generated scripts, something you simply could not tolerate.
But if I want to add a particular constraint to SQL, I can do it using the GUI in management studio, click the script option, and viola, a portable script without any effort on my part to remember the arguments, syntax, etc. of a fairly obscure command that you might write a couple of times a year.
Try - in this context means where it doesn't make sense (for performance, readability, etc reasons) then don't follow the guide. Follow it where possible and be mindful of when you deviate that you're adding to tech debt.
I don't know if you've ever worked with MS SQL Server in production but for us it's been the most amazing piece of technology in our stack for the past 15 years.
SQL Server is in the top 3 best things to be published by Microsoft ever. (The other 2 are probably Excel and Flight Simulator)
and SQL Server BI stack, is very solid and cost effective
SSRS is really good
SSIS is very acceptable (I mainly use to call SQL Tasks)
and SSAS is really second to none
SSRS still doesn't have support for msbuild, SSDT is a goddamned dumpster fire, and SSIS jobs have that hilariously over designed graphical programming language and are do frustratingly brittle that they are functionally single-use. And SSMS has the worst use of tabbed files I've ever seen.
There's a link to like about mssql, but a lot to hate too.
I never get the fussing with aligning. I get the importance of clean indenting, but so much SQL I see looks like the writer spent a long time tweaking the alignment with spaces to make sure the data types lined up perfectly or the ON clauses or what have you.... And then had do do it all over again the moment they changed something.
Just indent when scope changes, and don't align things. Alignment is a time-sink.
I always think of the DB, as the last castle of a program.
When everything else fails, the DB should function.
Programs are maintained way more than they're written.
No matter how hard I try to keep my code-base clean, it gets cluttered.
That's why I strive to create a powerful base structure and make them bulletproof.
Whatever may come (depression, personal problems, fatigue, stress, deadlines, etc.)
this skeleton keeps the app alive.
In my experience code that is written now, will stay and that complete rewrite I've been dreaming about will never happen :D.
So IMHO it's better to design the core funcionality exhaustively and try to create a failproof code-base.
My intellectual capacity fluctuates and I'm never as productive as I want to be. I need a safety rope for my dumber version, who has forgotten about the code and has no idea where to start :)
Indenting, and aligning helps a lot, when I try to load the code back into my memory.
Though mostly good advice, I definitely disagree with this:
DONT:
Plurals—use the more natural collective term where possible instead. For
example staff instead of employees or people instead of individuals.
I like naming my tables as plurals so that foreign keys to the table rows can have a name that relates to the table name. For instance, having column `Orders.employee` as an FK to an `Employees` record makes much more sense than having column `Orders.employee` as an FK to a `Staff` record.
The debate about singular/plural naming for relations comes up frequently. In my opinion, it stems from the fact that English has only 2 forms for all the possible grammatical cases. It has only a different form for the singular and plural cases, and thats what people concentrate on. I propose to take a step back.
There are quite some languages that have different forms for the nominative, accussative, dative and genetive case. Lets take Latin and the concept of an unpaid developer (servus).
Servus is nominative singular. Lets make it plural: servi.
Select from servi...
oh wait, we have a from so we need an ablative plural here. So we get:
select from servis...
What about updating? Then we need a plural accusative form:
update servos...
But are we updating the whole collection or just a single record? Then we should probably talk about servum.
Et cetera ad infinitum.
This might be a little overstretched.. Why don't we just use the normal form (the style guide talks about natural) of a noun? Good question, that is the singular nominative form. So, why make a special case for plural and ignore al other grammatical concerns?
Honestly, if Latin-speakers had invented high-level programming languages (and, especially, languages intended for the kind of use and audience that motivated the design of SQL), they'd probably just include noun declensions as part of the declaration, and then decline the noun properly for the way they were used, rather than settling on single form.
Yes, that's not something we'd do with Latin declensions in a language designed by English speakers, but that isn't really an argument against naming collections with plural nouns in such a language, just because doing what might be natural in a language designed by Latin speakers to resemble Latin doesn't make sense to us.
It makes me curious if you apply the same rule to variable names in (other) programming, when they contain multiple items of a kind (e.g. arrays, lists, sets and so on).
I always trying to use singular form. I didn't give it much thought, I just don't like repetition, and all those "-s" are exactly repetition (not even saying about non-trivial things like "man-men", etc). The only place where it bit me every time is table for users. "user" apparently is a reserved word and I can't use it as a table name (without quoting). Otherwise I liked my approach.
Honestly, they both look fine, but they do impart different meanings. "Staff" implies a set of employees dedicated to something like a company or project whereas "employees" could refer to multiple employees in any context.
If you're walking on to a new project, these subtleties make all the difference in understanding what's going on without having to rely on documentation.
In my case, we have anywhere from 3 to 20+ types of employees, and "staff" is the word commonly used to refer to one of those classifications. Certain information requires different levels of protection for some of those classifications. While people can transition between classifications, it's not a very common occurrence, and it only makes handling their information more difficult (and the old classification may apply to the information for a specific period of time after the change, or indefinitely for some types of data).
I basically end up with a "people" table which holds a very small amount of common data and is used in a lot of relationships.
Honestly, the distinctions require a great deal of documentation to fully understand. However, making the data less normalized and a little harder to comprehend also protects new people from making honest mistakes which could lead to legal issues.
I have often seen commas placed at the start of each line when columns are being listed. I'm a little surprised this guide doesn't follow that convention. It would fit nicely with the established pattern of lining everything up:
SELECT first_name
, last_name
, email_address
FROM users
Can anyone speculate why it's not done this way in the guide? I think the reasoning behind it was to make later modification easier because adding or removing the last field required modifying two lines.
This is a pretty commonly asked question. The answer is two fold:
1. putting the comma before doesn't actually make commenting out columns easier - it just moves the problem to the other end of the list. Just try commenting out `first_name` in your example and you'll see what I mean.
2. because normal convention would be for the comma to come immediately after an item in written English.
Re; #1, I also use comma-first indenting and I actually put each field in it's own line specifically for ease in commenting (first_name would have its own line)
I'm surprised too. I'm unaware of any disadvantages to doing it this way. I have always preferred this format myself. It makes things easier to comment out, it makes it so the clause never accidentally ends in a trailing comma before the next clause, and it looks nice.
I just think it looks visually ugly. I've heard others think that too, so it's not just me. I don't feel like the problem it solves is really that big of a deal. I can always just clone the second line up since order normally doesn't matter.
Yeah, basically it's a good practice to start field name rows with commas as it makes it quicker to comment them out as you're tweaking the SQL code (though there's no harm in putting multiple fields on a single row, especially if they come from related tables).
Besides the first reason pointed out by Treffynnon, it's worth remembering that the code we write is primarily for our fellow humans and that a piece of code will most likely be read more times than written so we better strive to optimize it for legibility.
If you're using Rails, your life will be much simpler if you do use an 'id' column, use plurals for table names, and name your join tables 'bars_foos', i.e. using the plural on both sides and in alphabetical order.
Don't really agree on indenting joins that are at the same level as FROM. There's not much semantic difference between the table in the FROM clause and one in a JOIN clause. Subqueries should be definitely be indented when nested.
Don't really agree with the leading whitespace alignment either. Alignment across lines is generally troublesome when combined with team development and version control: it leads to merge conflicts. If you start using 'GROUP BY' that forces you to introduce whitespace in your 'SELECT' line to bring everything into sync, then chances of a conflict increase quite a bit.
Don't 100% agree on using IN either, in particular if you're using a tuple membership check. That is, `JOIN x ON (x.a, x.b) IN ((1, 2), (2, 3), (4, 5))` certainly looks neater than the equivalent mix of ORs wrapped around ANDs, but it usually performs quite a bit worse because not many people write their SQL that way and the DB is not normally tuned for it.
A much bigger concern from my perspective (reviewing PRs from other developers) is writing SQL that will perform well for the database and data sizes. There are massive perf pitfalls around NOT EXISTS vs NOT IN, and it's easy to get horrific performance out of MySQL especially if using a subquery in the WHERE clause.
Structuring things as a join gets much more predictable performance, and structuring a set of joins as nested subqueries instead of joins at the same level (or using extra parentheses in Postgres) gives the SQL author more control over what will be done. Don't trust the optimizer to know best. It often doesn't and if it chooses wrong, it can hose a release. Measure, as always.
> Don't really agree with the leading whitespace alignment either. Alignment across lines is generally troublesome when combined with team development and version control: it leads to merge conflicts.
I'm definitely with you there. I've written about the disadvantages of this kind of column alignment several times on HN, so won't repeat them here. (Well, if anyone asks, I'll dig up a specific comment or two.)
I have a suggestion that can help anyone cure the temptation to column align all the things: Try coding in a proportional font for a while.
Contrary to popular wisdom, proportional fonts work fine for most kinds of programming. The only thing you can't do in them is column alignment, and in most cases this limitation is a good thing.
Instead of column alignment, you'll be forced to use indentation, and you may discover that indentation alone gives you enough ways to make your code structure clear without arbitrary alignment.
In my case, it happened the other way around. I got tired of the problems that column alignment caused, so I stopped using it and switched to indentation only. Some time later I was curious and tried a proportional font, and the code was still just as readable as before. So now I use proportional fonts all the time, and I'm no longer tempted to use column alignment.
My SQL style guide is quite a bit different from this:
* Use consistent and descriptive identifiers and names.
* Make judicious use of white space and indentation to make code easier to read.
* Try and use the same column/table naming convention as your programming language. If you language uses camelCase, use that. If it all possible, names should be the same everywhere to prevent confusion.
* Table names should always be in the singular: "Person" instead of "People", "Invoice" instead if "Invoices". This removes all pluralizing ambiguity and makes sense when mapping to class names, etc.
* Concatenate two table names together to create the name of a relationship table (especially in the case of many-to-many) unless a more appropriate name makes sense.
* Use TableNameId as the primary key column name.
---
* Do not use descriptive prefixes or Hungarian notation such as sp_ or tbl.
* Do not plurals -- don't even use collective names
* Avoid quoted identifiers
* Avoid underscores -- except for use a poor-mans namespacing
* Avoid simply using id as the primary identifier for the table.
* Avoid aliasing as much as possible unless joining the same table multiple times. In that case, use a descriptive name. Never use short one or two character aliases.
* Avoid putting keywords in all caps -- use an editor with syntax highlighting instead.
An interesting read! I do follow certain rules when writing SQL, and I agree that having them and following them is a good idea. Plenty of what is in the article looks like good advice.
However, these rules do not always appear to be consistent with how the code in (most) other programming languages is written. Consider indentation, for example. The usual approach is to line up those elements of the code which correspond to the same logical level of the flow, whether it is C++, Javascript, or Lisp. So rather than (quoting from the article)
SELECT file_hash
FROM file_system
WHERE file_name = '.vimrc'
I would rather see either
SELECT file_hash
FROM file_system
WHERE file_name = '.vimrc'
or
SELECT file_hash
FROM file_system
WHERE file_name = '.vimrc'
(The difference here is the same as between indented and non-indented braces in C.) I think this is especially useful when we have multiple joins or inner queries.
Many suggested rules are indeed consistent with what I have seen to be accepted as best practices. Such as naming a table in singular (more precisely, giving it the name of what a single row corresponds to), or avoiding the infamous Hungarian notation (pretty much an accepted best practice in most languages that I have seen).
Overall, I think, a good first step towards building best practices.
"Where possible avoid simply using id as the primary identifier for the table."
Has anyone had trouble by using surrogate primary keys? I've found the opposite of what the author said could be more true: composite keys should be avoided instead.
ON second reading I think you are correct. I agree on this point then, as it helps when joining through a denormalized table that references multiple PK ids, ex: I think:
user_id = role_id
is more readable than the likely alternative:
u.id = r.id
Generated unique tokens are better than auto increment IDs. They can be created in a distributed multi-master setup, and they don't expose how many of a given record type you have to 3rd parties when exposing primary keys or feeds via APIs.
I've had lots of problems with surrogate primary keys where people don't put enough constraints on the natural key and then duplicates (sometimes slightly different) get in.
Unless you're using innodb, primary key is not special, it's just another index. If you decide to use surrogate keys you also need to enforce the natural key too (if possible).
There are also some nice performance advantages you can get from natural keys with covering composite indices in the case where you only want the key. But in general a table can have multiple keys in all but the highest levels of normalisation.
That's a good point about manually defining constraints on a table. A composite key forces a developer to think about uniqueness whereas the surrogate key makes that thought process optional.
I disagree with this style point, and I think the author implicitly acknowledges that `id` should be the standard identifier, because further down he points out that the `_id` suffix should be used for columns that make reference to that identifier.
No he doesn't. In an example he actually uses the following:
ON s2.mentor_id = s1.staff_num
staff_num is the primary key
To me this isan anti pattern because:
- a primary should (can) never change;
- composite primary keys are only useful (save space) in very large datasets
It also conflicts with his pattern: uniform suffix _id
I am working with a 20 table db (not including lookups and session management tables) based on surrogate keys and it sucks ass for discoverability. It also subverts data checking based on SQL REFERENCES. I think natural / composite keys should used to design the db and establish relationships, surrogate keys added only if necessary, and then auto-generated, just like one waits to denormalize until there is a real need.
I suspect the love of surrogate keys is just more MySQL bullshit pretending that classic SQL design is wrong merely because it is not implemented in that particular broken database.
I disagree with this too, using table.id is easier to type and consistent when creating joins. It also encourages (forces) to reference the fields with tables prefix.
Maybe what author means is that from the db design standpoint, composite keys are used much less often than they could be. Very often unique table row can be defined by some other column or a pair, on which you will probably put an index anyway.
But that's just playing devil's advocate. In general I think ids are fine because ORMs play nice with them and FKs are simple.
Naming conventions are pretty controversial I think.
The databases I deal with are mostly for storing time series data taken from measurement instruments (i.e Load Cells, Thermocouples, Flow transmitters, Pressure gauges etc). We seem to use prefixes to distinguish columns i.e.
MS_ (Mass in Tonnes, MS_KG_ etc. Is used If not stored as Tonnes)
VL_ (volume, in cubic meters)
RT_FL_ (is Flow rate in normal cubic meters per hour)
TP_ (is temperature in degrees Celsius)
PR_ (is gauge pressure in KPa)
I've seem some databases which didn't use a prefix convention and in my experience it was a real nightmare trying to keep track of units etc.
There is an ISO standard for naming conventions ISO:11179 - My work's databases are (apparently) configured based on this but I'm sure the DBA's extended it in a bunch of ways.
I tend to prefer English suffixes and camelcase rather than Hungarian prefixes and underscores. Like, for a steel coil, InnerDiameterMM, MassTonnes, GaugeMM, etc. I prefer to leave abbreviations for cases where it's obvious even to non-experts, like mm for millimeters.
My only issue with that is you'd wind up with some very long column names.
Using prefixes you can encode a lot of information in a relatively short ID. Which may or may not be useful. As you mentioned though the downside is it requires some domain knowledge to decode.
A lot of our columns are in a form like this:
TP_TSC_HEX1_XXXXX
(where XXXX is some ID usually from scada system)
I can see from the name this is a temperature corresponding to thermocouple id: XXXX in TSC Heat exchanger 1.
Edit: I should also mention some programming languages (not sql) have a language feature for specifying named ranges (I think it is called lazy evaluation) which tends to favor prefixed names over suffixed - I.e x1,x2,...,xn to iterate over the prefixed range. We use SAS heavily at my work which supports using this syntax, which probably influenced the design of our DB tables.
Still dont get why sql uses all caps for keywords. It's one of the few languages that has it as a best practice. I can read C/JavaScript/go just fine without all caps.
You can ignore the boilerplate if its in CAPS. Quite a few DB engines enforce as a syntax error a SELECT that isn't in the precise order of FROM WHERE GROUP HAVING ORDER LIMIT (plus some others depending on your DB). You and I understand what we mean if we write
LIMIT 10
SELECT *
FROM wtf
but many DB engines will be unhappy because LIMIT has to be after FROM, etc.
Once you're used to the strict order rules, you can skip reading anything in CAPS as long as the bug you're fixing isn't that they're myspeled or order out of.
Personally I don't care either way, but I like it to be consistent for any given sequence SQL in a file (SQL is usually embedded in some other language's quoting mechanism).
If everything is lowercase, you can tune in to spotting the keywords. If it's mixed, it's just confusing.
There's an argument to be made that because SQL is so often quoted in a different language, highlighting isn't often applied.
I used to use all caps for keywords before the invention of syntax highlighting. Now my keywords are highlighted in another color just as they are in every other language so I no longer all-caps them.
Encoding. Needs a standard for encoding. "We are all UTF-8 here" As a parody of a bad design: this table is UTF-8 except for that column which is UTF-16 LE and that column of names is CJK and the street addresses column which is 7bit ASCII.
If you embed encoding into a design where the data changes like putting png files into a column named icon_gif, you are deserving of physical punishment. A better design is column binary_blob_28 should really have a sister column named binary_blob_28_mimetype. And don't get cute and use silly names for mimetype, if IANA's never heard of it you probably shouldn't be using it as a name.
Oh and while you're all unicoding for a good time, standardize that everything you write is run thru a normalizer as appropriate, or at least "you tried".
Another fun area not covered (or I missed it) is nothing is more hilarious than a timestamp without associated timezone. So is 1846 a good Central time to eat dinner, or a good UTC time to eat lunch in the midwest USA, or ... some DB support a surprising amount of TZ conversion although TZ always turns into a surprising amount of pain. "We're storing every time as UTC" isn't the worst idea ever.
Where possible ALWAYS use "id" as the name of the table primary identifier and its always a bigint... why burn infinite limited brain cell cycles trying to figure out the prikey of an arbitrary table "hmm well table employee uses MD5(social_security_or_equiv) which is a 16 byte char as its primary key but table source_code uses git hashes now is it the full 40 bytes or is 7 byte prefix good enough (usually is...). If you have to look up every foreign key or even worse, guess, or worse yet, guess and get it wrong, thats the last time your prikey won't be a bigint named id. Oh and as a corollary all your foreign keys will always be named "something_id" type bigint where "something" is the exact table name, not just close.
For the largest subset of equipment you'll interconnect with, figure out the best hardware supported hash and use it for non-security (de-duplication, for example) purposes. That may very well be md5, or maybe the lower 4 to 6 bytes of md5, or whatever.
Speaking of security the guide didn't go into it, so unless you're conferred with someone who knows the problem domain plus a little about security, assume your DB will downloaded next week and published on wikileaks ... don't do something stupid with sensitive data. If theres no reason to store sensitive data, then don't store it at all to begin with!
- If a field should never be NULL, put NOT NULL in the field definition. If a field should be unique, put UNIQUE in the field
definition. This improves both correctness and index performance.
- A useful format for CREATE TABLE is the one you get
back from SHOW CREATE TABLE.
emacs C-u M-x align-regexp can do this, if you give it a regexp matching the keywords you wish to right-align and enter -1 when prompted for "group to modifiy". It would still be a hassle though.
If it's talking about primary keys (which it seems to be), it should be unique, full stop.
If it's not, where does uniqueness come in? Is it actually talking about indices, which, if I understand correctly, should have a wide spread of values (without any of them necessarily being unique) if they're to be useful?
Technically, a key is any column (or a set of columns) that you use to find records. It does not need to be unique.
But it tends to be more useful if it is more selective, I guess that's why they say 'to some degree'.
So now, if a key is also unique, it is called a 'candidate key'.
In general, you can have more than one candidate key in a table. For example, a natural key (such as order#) and a surrogate key (e.g. 'id').
Among candidate keys, you pick one, and call it 'primary key'.
> Technically, a key is any column (or a set of columns) that you use to find records.
Isn't that an index rather than a key?
I thought 'key' meant:
- a superkey (any set of columns the values of which must be unique in any given row; in a properly relational model, any relation will have at least one, the trivial superkey, the set of all attributes in the relation)
- a candidate key (a minimal superkey, i.e. one without redundant columns -- if you remove any column from the key, it will cease to be unique)
- a primary key (the candidate key considered to be and designated as the most important)
I agree, the definition I provided (not mine) sounds more like an 'index' than a 'key'.
I am not familiar with that usage, I am used to 'key' being used as a logical concept, a constraint, while 'index' is a physical concept, some structure that is there 'just' to make shit go faster.
Anyways, some sources don't even define the word 'key', but use it to mean 'unique key'.
And some other sources (like the one by OP) use it to mean something else (totally not unique), as they recommend that it be 'unique to some degree'
And then some others use the definition I provided above. Is that the one meant by OP? I don't know.
I any case, I guess OP meant something like "indexes should be highly selective'.
Which BTW is a performance tuning advice, I am not sure that that belongs in 'SQL Style Guide'.
OK, so I think this might be a language/understanding difference here - perhaps what I have written isn't clear enough - not sure. The statement to some degree, in my opinion, doesn't imply that the key isn't actually unique. It still means unique.
What I meant is that if you don't have a unique column to make, say, a primary key with then you need to make it unique by adding (an)other column(s) to it in the key definition. Each additional column being a degree of difference/separation in my mind as I was writing it.
I hope that explanation makes sense and doesn't confuse the issue further.
I'd like to get your feedback on what you think it could say to make this clearer - I want to convey:
a. it must be unique.
b. you can have multi-column keys (some users - especially those tied to an ORM - don't actually know that you can do this).
Glad to see I conform to the guide in most places, but one thing I have been doing for nearly all 35 years of my programming career is to prefix my SQL views with `vw` so that I can tell immediately in my queries when I am referencing a view that may be made up of other queries.
The part about indentation is a bad practice to follow.
Indenting the code that way is not maintainable. I witnessed that multiple times in my career and adopted three simple rules:
1. Use only one space character between any two words.
2. Use only one tab (4 spaces) per indentation level.
3. Use no more than two blank lines to vertically separate parts of the code.
These three rules are simple, easy to remember and I'm applying them to any language I work with.
This is great, agree with many of the suggestions.
The worst SQL for my eyes is the stuff that has multiple columns, etc. on the same line. Makes diffing stuff 2x as hard and you can't parse as much information. Though maybe I'm just used to that at this point.
Then let's agree just to accept that there is no ideal, but that there is benefit in using the same formatting, and that perhaps we should create a sqlfmt utility like gofmt and all just use that.
The first one is just poorly stated but is correct in spirit. Use the most inclusive and flexible name allowed by the business logic without being overly generic. Some would say thats the natural term, I guess. The first example of using "employee" as a term invokes murphy's law that the company will hire their first contractor or intern the first week the software ships, resulting in much confusion, wait you've got a column for employees wheres the column for contractors? The second example guarantees that someone in production will confuse the individual serving ice cream cup production table with your HR list of employees if you use the overly generic word "individual". Or maybe those individuals are sales prospects, not employees.
The second one is simply wrong because it burns brain cells when you come back to debug or extend something a year later and nobody memorizes the prikey of table production_quality_results, is it the serial number of the mfgrd object or the timestamp of the QAQC inspection or the serial number of the inspection activity or ... and when you look at column names in table 131 is drivers_license_id a foreign key to a row in the drivers_license table or just a raw store of data, like this is just where you store it in the system? This is especially hilarious if your FK and data source are similar bigint type, like a bigint to connect a FK to a prikey or is your component serial number literally a bigint itself, assuming the prikey is the data itself without dereferencing it will be a hilarious bug, "it seems serial number 10 doesn't exist in our assembly line" "Whoops thats actually row 10 of the production table, serial number whatevs".
I am totally lost when it comes to your comments on the second item. It seems you're trying to say that surrogate keys are easier to find or perhaps guess.
The indexes set out against a table can generally be accessed with a query making it very easy to work out what existing index suits your use case best.
A guide is better than no guide. I like the part about adding suffix strings to variables. Two things I would changes
* Add commas before the variable
* Do not use single letters for aliases. Some versions of hive will throw exceptions. Plus seeing tables a, b, c, d joined together is annoying to read.
If you're writing a lot of ad-hoc queries, I prefer to have keywords on their own line and also put commas at the start of the following line vs end of the previous. It lets you comment stuff out easier as you experiment. In a production query the only hard rule should be, "Please be consistent!"
Putting the comma before doesn't actually make commenting out columns easier - it just moves the problem to the other end of the list. Just try commenting out the first column in your SQL and you'll see what I mean.
Trouble with this kind of thing is that if you make it general enough to apply to most situations you can't really say anything (too many exceptions) and if you get opinionated then your guide doesn't apply to a wide class of use cases.
Stuff:
1. Advice like "Use consistent and descriptive identifiers and names" - is general programming advice, not SQL style advice, and is so bland as to go without saying (would anyone seriously recommend being inconsistent or non-descriptive?)
2. Preferring standard SQL to vendor-specific SQL just isn't practical for nontrivial use cases.
3. "Use /.../-style comments where possible" - why? This makes commenting out large blocks of code really fiddly. Using double-dash comments, even for multiple-line comments, leaves /.../ usable for knocking out big blocks of code.
4. The advice on table naming - if a table contains one row per xyz, then calling the table xyzs and calling its id xyz_id is sound practice, arguably even if this does violence to the natural language in which xyz is a word. This reduces cognitive load on devs and makes the naming more predictable, especially bearing in mind that not everyone speaks the same first language.
5. No idea why "cars_mechanics" is worse than "services". "services" could mean practically anything.
6. There's a lot more to be said about table aliasing. Personally I prefer a, b, c... and then in derived tables, different groups of letters such as x, y... or p, q... . Stuff which requires an alias which will never be used (e.g. a derived table which is simply selected from) is aliased by _.
7. The guide says Hungarian prefixes are to be avoided, but a _num suffix denotes a number? Doesn't seem consistent.
8. All the advice on portability is unnecessarily opinionated. Portability is a concern that varies from completely unimportant to completely necessary, depending on what you're trying to achieve.
9. Indentation and layout - there isn't a "one size fits all" rule - it just comes down to judgement and taste. The given examples also look weird:
SELECT r.last_name
FROM riders AS r
INNER JOIN bikes AS b
ON r.bike_vin_num = b.vin_num
AND b.engines > 2
(a) Tables "riders" and "bikes" are siblings in this join so why are they indented to different levels?
(b) The AND isn't symmetric with respect to its operands.
IMHO this is better:
SELECT
r.last_name
FROM
riders AS r
INNER JOIN
bikes AS b
ON
r.bike_vin_num = b.vin_num
AND
b.engines > 2
Arguably, "b.engines > 2" should be in a WHERE clause instead of the ON clause, because it doesn't refer to table r. The query result is the same and the query planner probably produces an identical plan (depends on engine, of course), so this is genuinely just a style/semantics point.
10. "Avoid UNION" - why??
11. Avoid vendor-specific data types - this just isn't practical real-world advice. Sometimes you commit to a certain DB because you need to use its specific feature set. That's OK.
12. "Prefer NUMERIC and DECIMAL to REAL and FLOAT because rounding errors are annoying". This is completely determined by your specific requirements! Essentially this advice is saying "always prefer arbitrary-precision arithmetic". 32-bit and 64-bit floats are much faster and more compact than arbitrary-precision and are accurate enough for a very wide class of use cases. IMO this advice should be turned on its head: "use floating-point unless you know you need higher precision and you understand the performance/storage implications" - which would be the default approach in most programming scenarios.
13. "Avoid table partitioning" - this is a design/performance concern, not a style concern, and it depends entirely on use case.
14. "Avoid EAV, use a different product" - not real-world advice. Most devs don't have the luxury of just introducing different products when they run into an awkward scenario, and even if they did, introducing a whole new DB engine carries major overhead. Sometimes you just suck it up and have a small EAV wart in your design.
I think a lot of this comes down to what appears to be a simple misunderstanding.
Avoid - in this context means that where it does not make sense (for performance, readability, etc reasons) then don't follow the guide. Follow it where possible and be mindful of when you deviate that you're adding to tech debt.
Agreed, there is a lot I disagree with from the article, but this is the only one that my mind lashed out at as simply being WRONG. The only time I have seen numbers used on aliases in 10+ years is on self-joins. Even on self-joins I prefer to see something in the alias that identifies why that particular join is part of the query, but the reality is that such a descriptive name would often be so long that you're more likely to see a numeric suffix.
Yeah, "tbl" is redundant and so is the prefix "Member". In queries, you can use Member.id / member.name etc. for the same effect. I'd also drop the "date" because the it's repeating the type of data (probably inaccurately b/c it's most often a datetime). Sometimes I use pre- or postfix for disambiguation as in "isRetired" vs. "RetiredOn".
The primary column is always id, something like "memberID" would be some id created by an external system.
Why prefix your tables with tbl? Do you prefix your functions with "fun" and your variables with "var" and your classes with "cls"? It's completely redundant.
Some people do it to avoid clashes with reserved words.
Example: in some DBMSs, you cannot name your table 'user' because it is a reserved word.
"user" (double quotes) works, but there may be gotchas as sometimes double quotes suddenly make identifiers case-sensitive... but only if the filesystem is case-sensitive... things like that.
So you can either deal with the gotchas or say fuck it, put a prefix "tbl" and call it a day.
I'd much rather just name that table AppUser than prefix every single other table with "tbl". That's just taking one minor annoyance and multiplying it a thousand times.
In SQL server? In PostgreSQL you would wrap with double quotes. Which kinda sucks when your language uses double quotes for strings and you have to escape.
How about:
If you're reading some 50+ line SQL query with a bug, things like this are easy to read and be confident in. Honestly, I think AS should basically never be used for tables.Likewise, the guide gives the example of naming join tables 'services' instead of 'car_mechanics'. But if I see a table 'car', a table 'mechanic', and a table 'car_mechanic', I instantly know how to join against this in any direction. The hard part of understanding a schema is how it fits together, not what it represents, and when you have hundreds of tables and need to remember that join table's name, it's really, really nice if it's trivial to derive.
Lastly, the guide's recommended indentation for FROM vs JOIN statements seems off to me. Consider:
Well, the thing is, the riders table and bikes table have the exact same priority in the query, but this spacing really emphasizes the riders table and it's not clear why.