
A humble guide to database schema design - helloiloveyou
https://www.mikealche.com/software-development/a-humble-guide-to-database-schema-design
======
earthboundkid
Most of this advice is stuff they tell you in school that doesn’t matter at
all in reality.

Real advice:

\- Normalize until it hurts, denormalize until it works. (I.e. prefer normal
forms but denormal will have better performance.)

\- You need nullable dates and nullable numbers. You never need a nullable
string. If you must distinguish blank string from unset, add a column called
“unset”. Using null for this will just result in bad data entry.

\- Names should be stored as display_name, first_name, and last_name. A
display name is not the same as just first plus last. They have different
purposes, so don’t shove them together.

\- When in doubt, just have a column full of JSON and figure it out later.

\- Offset pagination is bad for performance. Avoid it at all costs unless you
want some spider indexing page 10,432 of your archives to eat all your CPU.

~~~
teddyh
> _\- Names should be stored as display_name, first_name, and last_name._

No.

“[…] _refer people to this post the next time they suggest a genius idea like
a database table with a first_name and last_name column._ ”

— [https://www.kalzumeus.com/2010/06/17/falsehoods-
programmers-...](https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-
believe-about-names/)

You could, conceivably, have a "display_name" column, an "informal_name"
column, a "formal_name" column, etc. depending on your needs. But for
absolutely most people and purposes, a single "name" column will suffice, and
will steer programmers away from making unfounded assumptions about people’s
names.

~~~
Mikhail_Edoshin
You'll regret this decision as soon as you have to integrate with another
system that uses separate first and last names. Or as soon as you want to
build a sorted list of five thousand names or output a bibliography in a
specific format or anything.

Technically the right approach to the problem of storing international names
would be to store markup: a string with an associated structure and a
discriminant:

    
    
        <name lang="en">
          <first>Kenneth</first>
          <middle>Elton</middle>
          <last>Kesey</last>
        </name>
        <name lang="ru">
          <given>Fyodor</given>
          <patronymic>Mikhailovich</patronymic>
          <surname>Dostoyevsky</surname>
        </name>
        <name lang="es">
          <given>Gabriel</given>
          <family-paternal>García</family>
          <family-maternal>Márquez</family>
        </name>
    

You can then have precise automated ways to transform this into whatever
derived form you need. Using a single name field is just avoiding the problem.
Using multiple fields is a maintenance burden, because in most cases fields
are not fully independent and you have to update them all to keep the data
coherent.

~~~
pezezin
What you want is not markup, but tagged unions. It's a real shame SQL and many
other languages don't support them.

Also, your example about Spanish names doesn't take into account that nowadays
parents can choose the order of family names, and that gay marriage has been
legal for 15 years. Which reflects a deeper truth: don't make too strong
assumptions about reality, for it may change.

~~~
Mikhail_Edoshin
Tagged unions, if I understand what they are, can be done in SQL:

    
    
        create table Foo(
          fooId primary key autoincrement, 
          type integer /* tag/discriminant */ );
        create table FooA(fooId primary key,
          /* data specific to FooA */ );
        create table FooB(fooId primary key,
          /* data specific to FooB */ );
    

They're not quite markup though. Markup has a string/stream with associated
records. The number of these records, their kind, and their placement varies,
so it can handle ordering issues or multiple similar names or missing name
(e.g. missing middle name), or ignore parts of name quite naturally.

~~~
naasking
Yes, you can simulate them, but it's painful.

~~~
no-s
It's not a simulation, it's a data model for reducing the pain for what's
really a hard problem - names we call people are not unique, they are
colloquial identifiers (HHOS).

~~~
naasking
Joined tables on the same primary key are a simulation of tagged unions.
Proper tagged unions would solve more concisely and with more clarity.

------
tasuki
> Never store firstname and lastname in the same field

I disagree. Why would you want to force the "firstname lastname" pattern on
people? What if someone has three names? Or only one? Just use a "name" field!

~~~
CameronNemo
You seem to be assuming that each field:

1\. Could not contain whitespace.

2\. Would have a minimum length.

TFA specifies neither of these requirements. You could have someone's first
name (which I would recommend calling given name) be "Pablo Emilio" and their
surname be "Escobar Gaviria". Or alternatively you could have someone with a
given name of "Sukarno" and an empty string for the family name. Patronyms and
matronyms are sort of ambiguous, but would probably be categorized as
surnames.

I use a single name field, but I don't think a given name and surname split is
crazy.

~~~
ozim
Wait till you get into some Dutch surnames "van der Kleij", "van den Bosch",
"de Jager" which are not patronyms. They would still fit into "first name,
last name" pattern. Because you assume what is last name can have
[https://en.wikipedia.org/wiki/Tussenvoegsel](https://en.wikipedia.org/wiki/Tussenvoegsel)

Though a lot of pages in NL have separate field for it.

~~~
earthboundkid
Arabic culture will sometimes adopt a name after you have a child as “father
of X”.
[https://en.wikipedia.org/wiki/Kunya_(Arabic)](https://en.wikipedia.org/wiki/Kunya_\(Arabic\))

~~~
DeathArrow
Yes, Arabs use "Ibn" and jews use "Ben". Slavs use father's name and "ich"
suffix. People in Iceland use different kind of patronimics.

You can either use just one name field if your logic doesn't need a name or
surname or provide two fields for name and surname and let the user handle the
splitting.

I have three first names. When I fill a form, if it's an official form or if
it's for business purposes I fill all my names as first name, if not, I only
write the name I use.

~~~
earthboundkid
Ibn is a patronym, but a kunya is the other way: your name is the name of your
child. E.g. Palestine’s Abbas is also “Abu Mazen”, the father of Mazen.

------
ebg13
Design guidance aside, this is not a good article because it doesn't show how
to fix the presented problem scenarios.

"This is not first normal form" \- OK, how would you fix it?

"This is not second normal form" \- OK, how would you fix it?

"This is neither second nor third normal form" \- OK, how would you fix it?

If you're going to show problems, show solutions as well.

The rest is a big red flag list of falsehoods that programmers believe about
names and addresses.

~~~
helloiloveyou
Although others have replied to you, I'll take your comment as constructive
criticism. Though the tone could be better.

------
tomlagier
If you're dealing with international addresses, point 3 becomes very
challenging. The 'tokens' of an address are called different things
everywhere, take different forms, and sometimes don't make much sense to
compare. Figuring out a good balance of usability and generality can be really
tricky.

Here's nearly 100 pages on the subject (from the perspective of addressing
mail for USPS):
[http://www.columbia.edu/~fdc/postal/](http://www.columbia.edu/~fdc/postal/)

~~~
crgwbr
I’ve been through this before: storing a couple addresses for a few hundred
different countries in an otherwise very normalized RDBMS. We couldn’t figure
out anyway to do it other than EAV tables and a defined attribute pattern for
each country.

Even though that project is long in the past for me, I’m really curious how
other people do this. Anyone come up with something better?

~~~
Benjammer
At that point you might as well just make an "Addresses" or "Address" table
with a raw text field for the address. At the end of the day this thing is
something that a human will read and map to the real world location.

~~~
uk_programmer
> At the end of the day this thing is something that a human will read and map
> to the real world location.

That is quite an assumption.

I personally would use a field with a structured format e.g. XML (most RDBMS
systems can deal with XML easily) and different types of address field e.g.
postcode, zip code etc.

Then you could use a strategy for each country when dealing with it in your
application.

------
me_bx
Off-topic comment / feedback about the blog post:

The author self-describes himself as "_Trusted_ Consultant" in the blog's
subtitle.

The comic strips illustrating the article do not mention about the permission
from identified license provider(s), and are cropped, does not help building
trust - at least in my case.

Are they used with authorization from their authors?

------
heisenzombie
This person would disagree about point 4:
[https://www.kalzumeus.com/2010/06/17/falsehoods-
programmers-...](https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-
believe-about-names/)

~~~
helloiloveyou
Author here! Ha, I'm a big follower of Patrick, specially his microconfs! I
didn't know he had an article an article on the subject, thanks for linking
it!

However I agree with him!

What happens is that in my experience I've yet to encouter a case where what I
suggest in the blog doesn't work. No doubt that there will be cases where it
will not be enough. But by reading Patrick's article, clearly the complexity
of the system needed to handle all the cases would be huge! So I guess one has
to compromise :)

~~~
jbotz
Yes, compromise in that typically US-centric way, and to hell with any culture
that doesn't fit the mold, they can adapt!

It's true, they can adapt! They proved that when they didn't have much choice,
back in the ASCII days when you had to conform or stay in the pre-computer
age. But maybe we should try to better now that we can...

------
boffinism
In what way is this guide 'humble'? Seems fairly self-confidently opinionated
to me.

~~~
redis_mlc
I'd agree. The advice to use NULL is usually harmful, and constraints are not
high-performance.

One nitpick is that "user" is a reserved word in PG, so "users" is better.

Other than that, I'd agree with most suggestions. But I'd use MySQL. :)

Source: DBA.

~~~
kbenson
> The advice to use NULL is usually harmful

The advice is to set NOT NULL if you think the field can't be null. That's
valid, and a matter of encoding the schema to correctly match the data. Same
with check values. Keep bad data out of the database wherever you can.

> and constraints are not high-performance

Then turn off constraint checking after you've built your schema and
identified them as a performance problem. Don't optimize away your data
integrity before you've even determined it's a problem.

If you actually don't care about your schema, just use a text field and throw
whatever in it, or use a document store without any strict checking. If you're
using SQL, presumably you're using it for a reason, and that's likely so you
can take advantage of what it has to offer.

~~~
helloiloveyou
Hey thanks for clarifying it for the other user! I also agree with everything
that you're saying, in this and other comments in the thread

------
awaythrower
Firstname lastname is too fragile and oversimplified. Suffix, prefix and a
variable number of middle names/initials can also be necessary. Occasionally,
legal name vs. nickname is also needed. But if there's no pressing need for
grouping or sorting by surname, why piecemeal any of it out?

------
LoSboccacc
> Never store firstname and lastname in the same field

funny, where would you store the middle name then? and the second middle name?
married name in cultures where the spouse changes or double it? of course in
normal form you need a name table, with an order column, and so on and so
long.

anyway, this article goes straight to the notions everyone can fetch on any
book and skip the interesting and diffiylcult part about database design,
which is understanding your domain so that no future expansion is prevented by
a wrong cardinality on a relation that was simplified from reality.

because if it's a user database you might get away with just the name, so you
can address people during support calls. if it's an accounting database, you
might want to know someone name at the time of invoicing, having multiple
records with start and end validity times, and if it's a FBI database for
protected identities you might have to have multiple name aliases referring to
a single person.

~~~
helloiloveyou
I'd store the middle name in the first_name column. Like for example "Juan
Martin" would be in the first_name columns and "Del Potro" in the last_name.
Apart from the middle/first name incongruence that you may highlight, I
believe that this design is not awfully bad and should serve in most
situations.

~~~
jimktrains2
If you used that as an address in emails it'd sound very awkward to include a
middle name when someone doesn't usually go by it.

------
atwebb
Since everyone has to have an opinion and provide it, here's mine, maybe a
retitle?

>A humble guide to transactional database schema design

There are a number of things that don't apply to analytical database design
and I'm seeing the two confused fairly often, then again, this comes entirely
from my point of view so take it for what it's worth.

------
davedx
I work on an e-commerce platform that's live in about 20 countries. One of the
hardest records to normalize correctly is the address, because not only do
users have very different rules depending on their country (e.g. not all
addresses in Turkey have post codes, GB has some very weird building
names/numbers...), we also connect with vendor API's that have their _own_
ideas of how to model an address.

My personal opinion on this is something like address_line_1, address_line_2,
city, region, post_code, with each field being varchar(max). Anyone else have
an opinion on how to do this?

~~~
mojzu
After spending a long time trying to figure this out I came to the opposite
conclusion of the article. Unless you're only operating in one/a small number
of countries trying to normalise addresses is too much of a moving target.

The schema I ended up with for my use case is: country (required), region
(optional depending on country), address (rest of address text) and a lookup
to get geographical coordinates for the address (with user confirmation to
tell if it's somewhat accurate). Filtering/sorting on country and region then
gets you pretty far and for the more complicated queries I found it took less
time to use the coordinates with PostGIS than trying to match up bits of the
address text.

~~~
davedx
Yeah, I guess it really depends on your use case and your overall system too.
Our system connects to multiple vendor API's, so having one field for the rest
of the address text would be very difficult -- we'd have to try and
automatically split it into constituent parts depending on the requirements of
vendors. Some of our data then goes to credit check API's too, so an incorrect
address can result in an (incorrectly) lost order for us.

It's a challenging area all round.

------
ghufran_syed
“Beginning database design” By Clare Churcher is a really accessible book on
these issues, very practical (starts off with simple spreadsheet tables!) and
takes you through step by step to understanding the problems with and the need
for the different levels of normalization.

------
winrid
Recently I ran into a project where everything is stored in around 3k unique
fields, with many levels of nesting, in a JSONB column with NO schema or
documentation. Generated JSDOC from the data and I'm hoping to create some
kind of schema.

You can use JS on the server. Just don't do it this way ...

------
gigatexal
I’m just here with popcorn and for the comments.

Lucky for me SQL just clicked and when I wrote my first schema third normal
form is how it came out more or less so I knew I was on to something.

I would say a study of how the internals of your databases of choice works
would help in schema design.

~~~
BlargMcLarg
How I wish for nothing more than an alternative to SQL to communicate with
databases, something that automatically infers coupling tables too. Set theory
is nice. SQL.. personally, hard pass.

~~~
gigatexal
What parts do you find difficult? I agree that the benefit to having something
learn the schema and the relationships would be nice but I get that through
something like DBeaver already.

------
DeathArrow
That's one issue I see with NoSql, data is not normalized and consistency goes
out of the window if database is big enough and the volume of data is big
enough. That might not matter if you work on a search engine, social network,
music suggestion app, but for most business apps and for most websites, having
inconsistency is really bad.

On the other hand, having some data denormalized can improve performance as
you can avoid expensive joins. It's for designer to see if they need to trade
consistency for performance and in what measure. I'd avoid it, I'd prefer to
add more servers, use load balancing, use caches.

------
jonathanstrange
In my programming life I've barely used SQL databases so far, and there is
something I've never understood about these recommendations. Maybe someone
here can explain that to me.

I thought databases are designed to organize the data optimally internally and
execute all operations on them as efficiently as possible. Why do you then
need to organize databases so stringently? Doesn't that mean that there is
something wrong with how the db works?

To me the advice always seemed similar to bit twiddling instead of relying on
the compiler to optimize. Shouldn't the database lay out the data optimally
internally?

~~~
rntksi
I'm not here to say the author is right. S/he is making a lot of assumptions,
and as you can see in this thread, everyone is saying it depends.

But one fact is that reducing redundancy in a database is a GOOD thing.

To make a comparison, I'd say it's like when I have to maintain a codebase
that is large, and I have to pass a constant (let's say, ERR_STATUS = 1)
around in some functions in various folders.

I wouldn't want to have to define it in every single file. One would want that
in a header, in a library or in a file that defines static values somewhere.
Anywhere but only one location for it.

Database is a bit like a dumping ground. Once the business gets big enough, a
lot of other application are going to plunge in and do things with data. I
wouldn't want to have inconsistent data. As a programmer, if one had
inconsistent things in one's source code, one could deal with it. But data is
different, data is produced by users and interactions and so on. Once you
begin to have inconsistent data, you can't trust it anymore until you find out
exactly what did it. A lot of wasted time. And that leads to a lot of
problems.

But I agree with you about your point on performance. However by organising
data correctly, having indexes on things that matter, joining in the right
way, one's using business knowledge and programming knowledge to optimise the
database. And that's not something that can be done 100% on its own (as of
now).

------
collyw
My humble opinion is that poor choices in database design are the source of a
great deal of technical debt.

I used to avoid making schema changes as they were "scary" (until I got a
decent understanding of Django Migrations). Just a quick hack to make up for
it in the application layer was far easier. Then another, and another until
you have real mess on your hands.

And avoid nulls when you can. Unfortunately most of the time I have had a new
database design there is existing data in excel sheets that is messy and you
can't avoid it.

~~~
Aeolun
We have so many tables with is_active and active_col in the same table...

------
FpUser
I've got 2 things to say:

1) There is no "good design". It is always a compromise of trying to satisfy
various practical constraints. OLAP vs OLTP probably rings a bell here. And
this Last / First name thing is laughable. The complexity of handling all
possible human names can be insane. Depending on the goal the proper design
can be anything starting from single field and up to some relational
structure.

2) The article itself looks like let's just write something up to score some
points.

------
yellowapple
> 4) Never store firstname and lastname in the same field

I have the opposite opinion: unless you have a specific reason why you need to
know one's "first name" and "last name" (e.g. to interface with some other
system that requires names to be split up like that), just use a single field
and stop caring about whether or not the name conforms to your preconceived
notions of format and structure.

------
emmelaich
(fwiw, I like the article, and I find it suitably humble)

Just want to hang a thought here ... depending on the precise definition of
First Normal Form, it is either impossible to avoid[0] or impossible to
enforce[1] without further constraints.

[0] if definition is "must be rectangular", i.e. all columns have the same
number of rows and all rows have the same number of columns.

[1] if the definition includes "do not allow duplicate rows"

------
reader_1000
The Data Model Resource Book [1] is a good resource how to design database. It
is a kind of reference book

[1] [https://www.wiley.com/en-
us/The+Data+Model+Resource+Book%2C+...](https://www.wiley.com/en-
us/The+Data+Model+Resource+Book%2C+Volume+1%3A+A+Library+of+Universal+Data+Models+for+All+Enterprises%2C+Revised+Edition-p-9780471380238)

------
rukuu001
This article is perfect if you're just getting started.

There's lore around design tricks to get the most out of a database, but it's
often implementation or domain specific.

You'll learn what rules to break, and why, as you solve real life production
problems.

FWIW - you can go a long, long, long way before de-normalising is actually
necessary.

------
DeathArrow
I wish there will be a DBMS which satisfy all aspects of CAP theorem and also
provide high performance.

Until then, we are stuck with RDBMS and NoSql as lesser evils based on the use
cases.

~~~
zepolen
NoSQL doesn't magically solve the CAP theorum in my experience it actually
makes it worse.

------
paulmendoza
Make the primary key the thing you query by. Treat a sql database as similar
to a nosql database (no joins) and your performance will be amazing.

------
SPBS
RE: not storing address in a single field, splitting an address into semantic
fields feels like a premature optimisation. You’re forced to anticipate all
the fields that you may want to aggregate by, before you may actually even use
them.

Instead, you could just create a custom function that takes in a string and
tries to identify the city or street number or whatever element you want
extract from the address e.g.

    
    
        select * from users where extract_city(address) = 'london';

------
paulmendoza
Use the array data type for simple collections if using Postgres because it
saves a join.

------
DeathArrow
An advice I think is worth adding to the list is: use indexing.

------
banq
This a DDD aggregates design guide, not database schema

------
darth_avocado
Use third normal form. - Till your system gets 10 year old and linking your
user's first name to their address requires joins on 10 tables having millions
of rows.

